One of the problems mentioned when discussing relational databases scalability is that handling storage enforced relationships, ACID and scale do not play well together. In the NoSQL space there is a category of storage solutions that uses highly interconnected data: graph databases. (note also that some of these graph databases are also transactional).
Lately there have been quite a few interesting discussions related to scaling graph databases. Alex Averbuch is working on a sharding Neo4j thesis and his recent post presents some of the possible solutions. Alex’s article is a very good starting point for anyone interesting in scaling graph databases.
Then there is also this article on InfoGrid‘s blog that is presenting a different web-like solution based on a custom protocol: XPRISO: eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. While I haven’t had the chance to dig deeper into InfoGrid suggested approach there was one thing that caught my attention right away: while the association with web-scale is definitely an interesting idea, having specific knowledge of the nodes location and having to use custom API for it doesn’t seem to be the best solution. Basically the web addressed this by having URIs for each reachable resource (InfoGrid should try a similar idea, get rid of the different API for accessing local vs remote nodes, etc.)
Update: make sure you check the comment thread for more details about InfoGrid perspective on scaling graph databases.
Oren Eini concludes in his post:
After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:
- Batching cross machine queries so we only perform them at the close of each breadth first step.
- Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).
While I haven’t had enough time to think about this topic, my gut feeling is that possible solutions are to be found in the space of a combination of using unique identifiers for distributed nodes and a mapreduce-like approach. I cannot stop wondering if this is not what Google’s Pregel is doing (nb I should have read the paper (pdf) firstly).