NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Offline and Production Notes on MongoDB

Last week has featured two of the most interesting posts about MongoDB: first coming from Mathias Meyer (@roidrage) ☞ offline investigation of MongoDB and the second, a set of notes from running MongoDB in production published on ☞ Boxed Iced blog.

If you are interested in getting started with MongoDB, I’d encourage you take the time to go through Mathias’ post which covers the following aspects (I’ve also included a couple of comments)

  • collections and capped collections

    Note: I couldn’t really understand the usage of namespaces and the implication on indexes

  • data format
  • references

    Note: I’d also strongly suggest taking a look at MongoDB documentation on ☞ schema design for more details

  • indexes
  • updates
  • querying
  • durability

    Note: we have covered before the MongoDB durability tradeoff in much detail

  • replication
  • caching
  • backup
  • storage
  • concurrency

    Note: I’d really appreciate more details on this topic as it is not completely clear if all access (both read and writes) is serialized or just writes are serialized (or not?); also the impact on indexes is not clear either.

  • memory
  • GridFS
  • protocol access

    Note: We argued before that access protocols are extremely important. MongoDB is one of the NoSQL solutions using a proprietary protocols and tries to “compensate” for that with tons of libraries

  • sharding

    Note: probably biased, but I still wait for the moment MongoDB sharding would become at least beta.

MapReduce support seems to be missing from Mathias notes, but luckily we have that covered for you: MongoDB MapReduce tutorial.

While keeping in mind that some of these features are not unique to MongoDB and can be found in other systems, you should be ready to cross check your app requirements with the lessons learned by the guys at Boxed Ice:

  • namespace limits

    We split our customers across (currently) 3 MongoDB databases because there is a namespace limit of 24,000 per database. This is essentially the number of collections + number of indexes.

  • initial sync/replication of large databases

    Our databases are very large and it takes about 48-72 hours to fully sync all our current data onto a new slave in a different DC (via a site-to-site VPN for security). During this time you’re at risk because the slave is not up to date.

  • initial sync “slows” things

    When doing a fresh sync from a master to a slave, we have observed a “slowdown” in our application response times.

  • index creation blocks

    However, if you have an existing collection and create a new index on it then that process will block the database until the index is created.

  • efficiency of reclaiming diskspace

    We have found that there is a massive discrepancy between a master and a freshly copied slave.

Even if not every application will have to deal with the size Boxed Ice is dealing, I couldn’t stop noticing that parts of the process of scaling MongoDB were really painful. Or as Sergio Bossa (@sbtourist) put it in ☞ one of the comments:

Anyways, it seems indeed you had almost the same problems you would had with a MySQL solution:

  • Huge data to deal with.
  • Manual sharding.
  • Sync/replication delays.

So why didn’t you evaluate to switch to a more “large-scale” nosql solution like Cassandra or Riak?

Last but not least, drop me a note if you are planning to use or already using MongoDB in production and you’d like to share your experience with the NoSQL community