Scaling Graph Databases: Sharding and Consistent Routing
The subject of scaling graph databases is popping up every now and then proving that sharding highly connected data is still an unresolved problem.
Jim Webber[1] has published a series of posts — here, here, and here — discussing generic graph sharding solutions, the route Neo4j is taking for addressing this problem, and, in the last post, a simple strategy — suggested by Mark Harwood — for deciding a graph database scaling approach:
Dataset size: Many tens of Gigabytes
Strategy: Fill a single machine with RAMDataset size: Many hundreds of Gigabytes
Strategy:Cache shardingDataset size: Terabytes and above
Strategy: Domain-specific sharding
The cache sharding approach — described here — suggests replacing the problem of graph sharding with “the simpler problem of consistent routing”. But I’m not sure how this solution works effectively:
- if there is a way to provide consistent routing isn’t that equivalent with having an solution for sharding the graph?
- considering graph databases are chatty — in the sense that most of the time there is a complex traversal happening — how smart should the client and the router be to work effectively and efficiently?
For now I feel that being able to solve the consistent routing in a graph implies having a strategy for sharding a graph. The reverse applies too: having a sharding strategy would provide a routing solution. Thus scaling graph databases remains a subject open for research.
-
Jim Webber: Chief Scientist with Neo Technology, @jimwebber ↩
Original title and link: Scaling Graph Databases: Sharding and Consistent Routing (NoSQL databases © myNoSQL)