NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



document store: All content tagged as document store in NoSQL databases and polyglot persistence

On Why I Think These Pro MongoDB Arguments Are Not Unique…

A couple of days back I’ve read ☞ a blog post with what I’d call an extremely catchy title: “Why I Think Mongo is to Databases what Rails was to Frameworks“. While the 7 reasons presented in the article are not wrong by themselves, I think that the features mentioned are not so unique to MongoDB.

But let’s take them one by one…

1. Migrations are Dead

[…] migrations are so last year.

Throw a new key into any model and you can start adding data to it.

The whole thing about migrations is related to the complexity of mutating RDBMS imposed fixed schema. In other words, any schema-less solution, being it a document database or a key-value store or even a schema-less RDBMS will show the same benefit.

2. Single Collection Inheritance Gone Wild

By using inheritance, they all share the same base keys, validations, callbacks and collection.

Before looking at inheritance we’d need to firstly separate state and behavior. And then separate behavior into behavior that can be implemented close to the data by behavior that belongs to the object/app model.

Behavior characteristic to the object/app model is not important here as it has nothing to do with the data store. The kind of behavior that can be implemented close to the data (.f.e validations) have been long supported by RDBMS by means of simple data type definitions, constraints or even triggers.

So we are left with mapping inheritable state to data store. As we already know, key-value stores are most of the time completely unaware of the data structure and so inheritance has no meaning there. For approaches where the store must be aware in some way of the data structure, I’d say that over years RDBMS and ORMs have come up with an extremely well designed approach for handling it and I’ll just mention three basic strategies: table per class hierarchy, table per subclass, table per concrete class. In case you’d like to read more on this I’d recommend this Hibernate (Java ORM) ☞ doc.

5. Embedding Custom Objects/Hash Keys/Array Keys

For the next three points, I have reversed the order as I do see them as specialized cases of this more generic one.

Mongo natively understand arrays […] you can even index the values and perform efficient queries on arrays

As if array keys were not enough, hash keys are just as awesome.

What is that you say? Arrays and hashes just aren’t enough for you. Well go fly a kite… or just use an embedded object.

Storing custom object was “always” possible, even if we are including here key-value stores or even RDBMS (nb it is obvious that document stores can handle this scenario). Over time, the most concerns expressed related to custom objects where in terms of efficiency/performance of storing/fetching such data and data layout (nb what I mean is how transparent is to operate with such an object).

So I’d say that the real questions/feature here would be:

  • does the engine have an optimal strategy for storing/fetching this sort of ‘objects’? (f.e. how does it deal with array size modifications, etc.)
  • in case your app needs to access details of such ‘object’, does the store support it? (f.e. can I filter results based on such on ‘object’ field value(s)?)
  • in case such ‘objects’ are used to model relationships, how is your engine helping you avoid the N+1 query issue?

6. Incrementing and Decrementing

I see incrementation/decrementation as just a particular case of the generic “read and modify value” scenario, which is supported by both RDBMS and column-based stores (nb you can correct me on this one as I haven’t checked them all). There is an additional characteristic of this operation that is probably making the difference: atomicity. An even more generic feature that would fit this scenario is the ☞ compare-and-swap.

7. Files, aka GridFS

Mongo actually has a really cool GridFS specification that is implemented for all the drivers to allow storing files right in the database. I remember when storing files in the database was a horrible idea, but with Mongo this is really neat.

Well, I guess everyone tried at some point to store files into MySQL or other RDBMS. The whole issue related to it was the performance of the operation and how handy the API was.

In the end, please allow me say it once again that my intention is neither to argue against MongoDB features nor to deny how important these features can be for an application, but rather to clarify that these features are not unique to MongoDB. And if I misinterpreted any of these please feel free to correct me.

2009 Last NoSQL Releases

I guess these are the last releases for an eventful 2009 NoSQL year:

Mongo 1.2.1

Mongo 1.2.1 is just a minor release featuring the following bug fixes:

  • mongoimport now works on windows
  • gcc 4.4 can be used to compile
  • better map/reduce error handling

You can read the announcement ☞ here, the complete changelog ☞ here and download Mongo 1.2.1 from ☞ here.

In case you are planning on using MongoDB, I’d encourage you to check these MongoDB screencasts and all MongoDB coverage on MyNoSQL.

Terrastore 0.3

A day after our coverage of Terrastore, a consistent, partitioned and elastic document database, the 0.3 version was released featuring a much easier installation tool. You can read the announcement ☞ here. Sergio Bossa, Terrastore creator, has published a nice summary of what Terrastore is ☞ here.

Neo4j 1.0-b11

Last, but not least I should also mention ☞ Neo4j latest RC before 1.0. The case for graph databases should give you a quick understanding of why and when Neo4j can be a better fit for your app.

And with this, I am looking forward to more exciting NoSQL releases in 2010.

Riak Presentations and Screencasts

After the model set by MyNoSQL, Basho guys have published their list of Riak screencasts and presentations. While some of these already made it on MyNoSQL, here is the complete list:

  • Justin Sheehy: Riak: Control your data, don’t let it control you (☞ video)
  • Brian Fink: Riak: web-shaped data storage system (☞ video)
  • Dave Smith: Rebar (☞ video)
  • Rusty Klophaus: Nitrogen and Riak by Example (☞ video)
  • Martin Scholl: Riak, a distributed, web-inspired database ( ☞ video)
  • Brian Fink: Intro to Riak (☞ video)

And as you are interested in document databases, I’d also recommend the Introduction to MongoDB Screencast


Presentation: MongoDB by Kyle Banker

Kyle Banker is doing a quick intro to NoSQL world and MongoDB with Ruby.

Terrastore: A Consistent, Partitioned and Elastic Document Database

Terrastore is a very young Apache licensed document store solution built on top of the Terracotta (an in-memory clustering technology) that released its 0.2 version a couple of days ago.

I had the opportunity to chat with Sergio Bossa (@sbtourist) and have him answer a couple of questions about Terrastore.

Alex: What is it that made you create Terrastore in the first place?

Sergio: I wanted a scalable document store with consistency features, because I think that’s an uncovered topic/space in current implementations, which are all geared toward BASE.

Being a document database, Terrastore belongs to the same category as CouchDB, MongoDB, and Riak. In some regards (f.e. partitioning), Terrastore is similar to Riak. You should also check [1] to find out more about Terrastore and the CAP theorem.

Terracotta replication is not full, nor geared toward all nodes, but only those actually requiring the replicated data. This is more and more optimized in Terrastore, where, thanks to consistent hashing and partitioning, data is not duplicated at all. Terrastore also guarantees that data will never be duplicated among nodes, unless new nodes are joining or older nodes are leaving, thus requiring data redistribution. A Terrastore client doesn’t need to know where the data is: it can contact whatever Terrastore node and requests will be routed to the proper node holding the value (note: this is similar to the way Dynamo, Project Voldemort, Cassandra and other distributed stores are working)

At this point, more people have joined the chat and so more interesting questions and answers were coming up.

Alex: Considering Terrastore is built on top of Terracotta, is it an in-memory storage making it somehow similar to Redis?

Sergio: Correct, it stores everything in memory, but it is persistent as well. It is not as fast as Redis mainly due to some overhead related to its distributed features.

Paulo Gaspar: Terrastore looks very much like a persistent, transactional Memcached service.

Sergio: Persistent, transactional, and partitioned/sharded. An interesting difference is that afaik Memcached partitioning is done client side, while Terrastore has builtin support for data partitioning, distribution and access routing.

Terrastore is already HTTP and JSON friendly [2] and the future might bring support for the memcached protocol too.

Please see the following resources to learn more about Terrastore:

Introduction to MongoDB Screencast

The people at Teach Me to Code have published a 3 part screencast about MongoDB. The episodes are covering the following aspects:

  • CRUD operations using MongoDB shell
  • creating a Ruby application that accesses MongoDB
  • using MongoMapper (see NoSQL libraries) with your Rails app and MongoDB

You can watch the complete series below (episodes are 13min, 21min and respectively 10min long. Also make sure you check Michael Dirolf’s Introduction to MongoDB.

Introduction to MongoDB: CRUD operation using MongoDB shell

Introduction to MongoDB: building a Sinatra based app interacting with MongoDB

Introduction to MongoDB: Rails and MongoMapper

The videos are also available for download (see the reference section). And you can always watch more NoSQL videos by using the video tag.

And while we are at document stores, I’d encourage to also check Riak Presentations and Screencasts

Geo NoSQL: CouchDB, MongoDB, and Tokyo Cabinet

A lot of people say that location-enabled services will be the #### [*] of tomorrow, so is there any Geo NoSQL?

Populating a MongoDB with POIs

What I especially liked is the flexibility you get from this kind of databases (nb MongoDB) and the ease of installation and use. The downside for geographic applications is that at the moment there is no built-in support for geometries.

Using MongoDB to Store Geographic Data

Managing GIS data with NoSQL in circumstances where performances and scalability are a major issue could be the way for the win.

GeoCouch: The future

What I call “complex analytics” is things like: “return all apple trees that are located with a 10km range around buildings that have are over 100m high, but only in countries with a population over 50 million people” is not possible with GeoCouch as you would need the attribute values as well. Those are stored in CouchDB, so you would need to request them. What GeoCouch only supports is a simple: give me all IDs within a bounding box/polygon/radius.

Tokyo Cabinet: Loading and querying point data

I’m going to load 500.000 POIs in a database and query them with a bounding box query. I will use the table database from Tokyo Cabinet because it supports the most querying facilities. With a table database you can query numbers with full matched and range queries and for strings you can do full matching, forward matching, regular expression matching,…

And so the answer is: yes, we do have some Geo NoSQL!

In some geo parts of the world we are celebrating Christmas today, so Merry Christmas to everyone!

Release: Riak 0.7

The announcement of Riak 0.7 release came in yesterday and can be read ☞ here.

Probably the most interesting new feature is Riak with embedded Erlang.


CouchDB vs MongoDB: An attempt for a More Informed Comparison

After posting about Scott Motte’s comparison of MongoDB and CouchDB, I thought there should be some more informative sources out there, so I’ve started to dig.

The first I came upon (thanks to Debasish Ghosh @debasishg) is an article about ☞ Raindrop requirements and the issues faced while attacking them with CouchDB and the pros and cons of possibly replacing CouchDB with MongoDB:


  • Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
  • Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
  • Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
  • Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the “joins” effectively in our API app code.


  • easy master-master replication. However, for me personally, this is not so important. […] So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.
  • ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language — right now we are on the path to inventing our own with the server API with a “query language” made up of other API calls.

Anyway while some of the points above are generic, you should definitely try to consider them through the Raindrop requirements perspective about which you can read more here.

Another article comparing MongoDB and CouchDB is hosted by ☞ MongoDB docs. I find it well balanced and you should read it all as it covers a lot of different aspects: horizontal scalability, query expressions, atomicity, durability, mapreduce support, javascript, performance, etc.

I’d also mention this ☞ benchmark comparing the performance of MongoDB, CouchDB, Tokyo Cabinet/Tyrant (note: the author of the benchmark is categorizing Tokyo Cabinet as a document database, while Tokyo is a key-value store) and uses MySQL results as a reference.

In case you have other resources that you think would be worth including do not hesitate to send them over.

Update: Just found a nice comparison matrix [1].

As a teaser, very soon I will introduce you to a new solution available in this space, so make sure to check MyNoSQL regularly.

Update: The main article about this new document store has been published: Terrastore: A Consistent, Partitioned and Elastic Document Database. I would strongly encourage you to check it, as Terrastore is looking quite promising.

Blog Engine Based on MongoDB

Just in case you want to change your blog to one running MongoDB with Mongomapper. Be prepared for an odd UI though!

Code available on ☞ Gitorious.


CouchDB Full Text Indexing Prototype and Riak Search

A prototype for CouchDB full text indexing based on Joe Armstrong’s code from ☞ Programming Erlang: Software for a Concurrent World

The implementation is quite naive, using a couch database to store the inverted index, but it works surprisingly well for my use case and is very simple.

Not sure though that this prototype would have stopped ☞ the guys from Collecta to migrate to Riak and Riak Search.

The CouchDB full text indexing prototype code can be accessed on ☞ GitHub.


Running a CouchDB cluster on Amazon EC2

If you don’t use EC2 in a way that you always can loose one or two instances, you are using it wrong. If you are not spinning up servers in a way that it takes the same time to set up one instance than it takes to set up 10 instances you are using it wrong.

That’s so true! (make sure you also read [1]). But I’d say the cherry on the cake would be adding partitioning/clustering, which do seem not to be mentioned in the post even if it appears in the title. CouchDB doesn’t yet support partitioning by default, so the solution would require usage of Lounge, the tool developed by the guys at