NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Scaling CouchDB

One of the most often heard questions about CouchDB is how do I scale CouchDB? While dealing with Google-size data is not really a problem many are facing, more and more engineers are looking for viable solutions for having less SPOFs in their systems.

We already know that starting with CouchDB 0.11.0 its support for replication became extremely smart allowing us to think about a future of CouchDB-backed distributed web data.

But replication is just one part of scaling. One other part which is more related to the size of data is data partitioning or sharding and I think that the question about how to scale CouchDB is mostly referring to it. So, let’s see what options are there for sharding CouchDB.

Right now, probably the best known solution for addressing CouchDB sharding is ☞ Lounge, a solution developed by the team:

The Lounge is a proxy-based partitioning/clustering framework for CouchDB.

There are two main pieces:

  • dumbproxy: An NGINX module to handle simple gets/puts for everything that isn’t a view.
  • smartproxy: A python/twisted daemon that handles mapping/reducing view requests to all shards in the cluster.

Another solution in this space is ☞ Pillow. While ☞ still young it shows a lot of promise:

While manually repartitioning a CouchDB database is doable, I’d rather have an automatic way of doing it since I don’t want to make mistakes. […]

Now I’ve released version 0.3 of Pillow. This version supports automatic resharding, routing of requests to the right shard and views. Reducers need to be written in erlang, but a summing reducer is in place and mappers without reducers are supported out of the box. As such, this version of Pillow has all the functionality I set out to develop, but it does not support the full CouchDB API.

A more generic solution for partitioning data can be built using the Twitter’s ☞ Gizzard framework for creating distributed datastores:

Gizzard architecture

As per the above diagram, Gizzard would become the “smart” middleware that would handle data distribution between the different stores. There are quite a few interesting features available in Gizzard that can make it a good candidate for dealing with the data partitioning problem: programmable support for partitioning strategies, support for replication trees, fault-tolerance, winged migrations, just to name a few.

Last, but not least, during the Berlin Buzzwords conference, I have also heard about another solution that is in the works and will be released soon.

So, having quite a few possible solutions already available, scaling CouchDB should not be considered an issue anymore.

Important update: CouchDB has a new scaling solution through BigCouch

Original title and link for this post: Scaling CouchDB (myNoSQL © NoSQL databases)