NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Sharding: All content tagged as Sharding in NoSQL databases and polyglot persistence

RavenDB Sharding

Ayende published two articles on implementing sharding with RavenDB: first one using a default round robin strategy here and the second one sharding based on a set of rules here.

What I’ve noticed in these posts:

  1. RavenDB requires defining the actual shard servers in your sharding implementation (i.e. in your source code)
  2. when performing writes there’re a bunch of round trips (for id generation)

    The key to reducing latency is saving round trips

Original title and link: RavenDB Sharding (NoSQL database©myNoSQL)

Database Sharding Using a Proxy

ScaleBase’s Liran Zelkha is making the case for database sharding using a proxy:

First and foremost, since the sharding logic is not embedded inside the application, third party applications can be used, be it MySQL Workbench, MySQL command line interface or any other third party product. This translates to a huge saving in the day-to-day costs of both developers and system administrators.

Compare ScaleBase’s proxy-based sharding:

ScaleBase Proxy Sharding

with MongoDB’s sharding:

MongoDB sharding

Another example would be the Hadoop HDFS NodeName which provides somehow similar functionality.

Original title and link: Database Sharding Using a Proxy (NoSQL database©myNoSQL)


MySQL Sharding vs MySQL Cluster

StackExchange Q&A:

Q: Considering performance only, can a MySQL Cluster beat a custom data sharding MySQL solution?

A: I would say that MySQL Cluster could achieve higher throughput/host than sharded MySQL+InnoDB provided that :

  • Queries are simple
  • All data fits in-memory

In terms of latency, MySQL Cluster should have more stable latency than sharded MySQL. Actual latencies for purely in-memory data could be similar. As queries become more complex, and data is stored on disk, the performance comparison becomes more confusing.

Make sure you read the complete answer as it covers some more MySQL Sharding vs MySQL Cluster pros and cons.

Mat Keep

Original title and link: MySQL Sharding vs MySQL Cluster (NoSQL database©myNoSQL)

Shard-Query and Distributed Set Processing

About Shard-Query:

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources.

And how it applies the relational algebra to distribute operations:

Every SQL statements breaks down into a relational algebra equation. In Algebra you learn that some operations are “distributable”, that is, you can split them up into smaller units, and when you put those units back together the result is the same as if you didn’t take it apart. Projection, GROUP BY, SUM, COUNT, MIN*, and MAX* are distributable. With a little elbow grease, all non-distributable aggregate functions (AVG,STDDEV,VARIANCE, etc) can be decomposed into distributable functions using simple substitution rules.

Original title and link: Shard-Query and Distributed Set Processing (NoSQL databases © myNoSQL)

RavenDB to Add Auto Sharding


At some point you realize that the data has grown too large for a single server, so you need to shard the data across multiple servers. You bring up another RavenDB server with the sharding bundle installed. You wait for the data to re-shard (during which time you can still read / write to the servers). You are done. At least, that is the goal. In practice, there is one step that you would have to do, you would have to tell us how to shard your data.

In theory, auto sharding would be available in every database. In practice, there are just a few where auto sharding actually works.

Original title and link: RavenDB to Add Auto Sharding (NoSQL databases © myNoSQL)


Scaling Graph Databases: Sharding and Consistent Routing

The subject of scaling graph databases is popping up every now and then proving that sharding highly connected data is still an unresolved problem.

Jim Webber[1] has published a series of posts — here, here, and here — discussing generic graph sharding solutions, the route Neo4j is taking for addressing this problem, and, in the last post, a simple strategy — suggested by Mark Harwood — for deciding a graph database scaling approach:

  1. Dataset size: Many tens of Gigabytes
    Strategy: Fill a single machine with RAM

  2. Dataset size: Many hundreds of Gigabytes
    Strategy:Cache sharding

  3. Dataset size: Terabytes and above
    Strategy: Domain-specific sharding

The cache sharding approach — described here — suggests replacing the problem of graph sharding with “the simpler problem of consistent routing”. But I’m not sure how this solution works effectively:

  1. if there is a way to provide consistent routing isn’t that equivalent with having an solution for sharding the graph?
  2. considering graph databases are chatty — in the sense that most of the time there is a complex traversal happening — how smart should the client and the router be to work effectively and efficiently?

For now I feel that being able to solve the consistent routing in a graph implies having a strategy for sharding a graph. The reverse applies too: having a sharding strategy would provide a routing solution. Thus scaling graph databases remains a subject open for research.

  1. Jim Webber: Chief Scientist with Neo Technology, @jimwebber  

Original title and link: Scaling Graph Databases: Sharding and Consistent Routing (NoSQL databases © myNoSQL)

MongoDB Pre-Splitting for Faster Data Loading and Importing

Jeremy Zawodny (Craigslist) explains how to optimize the import of data that far exceeds the amount of RAM available in a sharded MongoDB cluster:

Many of the chunks the balancer decided to move from the busy shard to a less busy shard contained “older” data that had already been flushed to disk to make room in memory for newer data. That means that the process of migrating those chunks was especially painful, since loading in that older data meant pushing newer data from memory, flushing it to disk, and then reading back the older data only to hand it to another shard and ultimately delete it. All the while newer data is streaming in and adding to the pressure.

That extra I/O and flushing eventually manifest themselves as lower throughput. A lot lower. Needless to say, the situation was not sustainable. At all.

This is a perfect example of how to investigate an issue in MongoDB auto-sharding.

Original title and link: MongoDB Pre-Splitting for Faster Data Loading and Importing (NoSQL databases © myNoSQL)


On Sharding Graph Databases

We can help to maintain a balanced graph by applying domain-specific knowledge to place nodes on shards; we can use insert-time algorithms to help us select the most appropriate shard to place a node; and we can use re-balancing algorithms periodically (sort of like a graph defrag) to help maintain a good node-to-shard mapping at runtime. We can also keep heuristic information on which relationships are traversed most frequently and optimise around those.

For graph databases, the problem is that what is optimal for a scenario can be a huge issue for other scenarios. If only you could rebalance on a scenario basis and that without killing the inter-node communication.

Original title and link: On Sharding Graph Databases (NoSQL databases © myNoSQL)


MongoDB Replica Sets and Sharding Question List

I’m seeing many questions asked[1] about MongoDB replication and sharding, so I thought it might be an idea to try to gather the most interesting ones and submit them to the MongoDB/10gen people to get some answers. So, after reading the documentation on MongoDB replica sets[2] and MongoDB sharding[3], what questions would you want answered?

Please post a comment with your questions and I’ll be adding them to the list. After accumulating a couple of good questions I’ll forward them to MongoDB/10gen people for answers.

To start with, here are a couple from me:

  1. What may be causing replica sets to lag behind the master or become “disconnected” (except network partitions)?
  2. How would one determine the best size of sharding “chunks”?
  3. Is it possible and what would lead to an unbalanced cluster?

What are yours? Feel free to forward it to any friends using MongoDB that are looking into using replica sets and sharding.

Note: I’m not planning to create yet another forum or Q&A site, so I’ll make sure that once we get some answers these will be published in a place where everyone interested will find them easily.

Note: If you are interested in getting answers about other NoSQL database, please let me know and I’ll create the initial list.

  1. Not only on the ☞ MongoDB group, but also on, and blogs  ()
  2. MongoDB replica sets resources  ():
  3. MongoDB sharding resources  ():

Original title and link: MongoDB Replica Sets and Sharding Question List (NoSQL databases © myNoSQL)

Sharding with SQL Azure

Just after posting about this excellent and SQL Azure comparison, I have found another interesting Microsoft Azure article.

It is about Sharding with SQL Azure and is covering aspects as principles, challenges, and common patterns for horizontal partitioning, a high level design of an ADO.NET sharding library, and an intro to SQL Azure Federations:

The proposed implementation will map data to specific shards by applying one or more strategies upon a “sharding key” which is the primary key in one of the data entities. Related data entities are then clustered into a related set based upon the shared shard key and this unit is referred to as an atomic unit. All records in an atomic unit are always stored in the same shard.

Be aware that the article is quite long, but definitely worth reading.

Original title and link: Sharding with SQL Azure (NoSQL databases © myNoSQL)


NoSQL: The Alternative to MySQL Sharding

Lynn Monson (@lmonson) has a very good list of possible issues while scaling out with MySQL (relational db):

You’re going to have to face a few realities:

  1. Master DB machines go down or get sick. Given a number of read slaves of that master, it’s non trivial to figure out which of the slaves becomes the new master and how to get the other read slaves to begin replicating from that new master at the proper point. (Amazon’s upcoming RDS read slaves are, therefore, pretty amazing technology)
  2. When master DB machines go down, you will incur a delay in setting up a new master
  3. How do you add new shards and rebalance the data?
  4. You need a way to know which shard to talk to. Perhaps by implementing your own version of Consistent Hashing?
  5. Caching is also your problem. You can use off the shelf systems, say memcached, to reduce the work but its till not zero work.
  6. You will most likely give up distributed joins between the various shards.

On the other hand, not all NoSQL databases are solving all these issues.

Original title and link: NoSQL: The Alternative to MySQL Sharding (NoSQL databases © myNoSQL)