NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



The impact of document IDs on performance of CouchDB

One CouchDB trick that is probably not so well known:

Now if the data that is inserted in the B-tree (in this case, the DocIDs) is random, it causes the tree to fan out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes than before.


The B-tree is why using the (semi-)sequential IDs is a real life saver. The new model causes the database to be filled in orderly fashion and the buckets (i.e. leafs) are filled in instead of leaving them half full. Best part here is that the auto generated IDs by CouchDB (which were not an option for us) already use the sequential ID scheme, so using those IDs you don’t really need to worry a thing.

So remember kids: if you cram loads of data in your CouchDB, remember to select your document ID scheme carefully!

The only thing that is missing is a pointer to what makes a good ID for CouchDB.

Original title and link: The impact of document IDs on performance of CouchDB (NoSQL databases © myNoSQL)