NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



mongodb case study: All content tagged as mongodb case study in NoSQL databases and polyglot persistence

MongoDB: A 12 month story from Wordnik

A sincere story of using MongoDB in production, with both good and bad:

As previously blogged, Wordnik is a heavy user of 10gen’s MongoDB. One year ago today we started the investigation to find an alternative to MySQL to store, find, and retrieve our corpus data. After months of experimentation in the non-relational landscape (and running a scary number of nightly builds), we settled on MongoDB. To mark the one-year anniversary of what ended up being a great move for Wordnik, I’ll describe a summary of how the migration has worked out for us.

On the good side the post mentions MongoDB performance and flexibility. On the side of things that can be better we have MongoDB reliability and maintainability.

Update: The ☞ HackerNews comments are interesting too debating pros and cons of MongoDB over MySQL

Original title and link: MongoDB: A 12 month story from Wordnik (NoSQL databases © myNoSQL)


Foursquare MongoDB Outage Post Mortem

We are finally getting some details about how and why MongoDB brought down Foursquare:

On Monday morning, the data on one shard (we’ll call it shard0) finally grew to about 67GB, surpassing the 66GB of RAM on the hosting machine. Whenever data size grows beyond physical RAM, it becomes necessary to read and write to disk, which is orders of magnitude slower than reading and writing RAM. Thus, certain queries started to become very slow, and this caused a backlog that brought the site down.


  • have writes been configured to use the MongoDB default fire and forget behavior? In that case it wouldn’t matter so much for the request processing that the write would go to disk
  • if replicas were available have reads been distributed among these?
  • if no replicas were available, how quick and what would be the quickest approach to bring up read only replicas?

We first attempted to fix the problem by adding a third shard. We brought the third shard up and started migrating chunks. Queries were now being distributed to all three shards, but shard0 continued to hit disk very heavily. When this failed to correct itself, we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM.


  • why the 3rd shard could accommodate only 5% of the data?

This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so many of them can fit on a 4KB page. Removing 5% of these just made each page a little more sparse, rather than removing pages altogether.[2]


  • I might be wrong, but it sounds like the problem here is that chunks of data are not using contiguous space. Considering that MongoDB is supposed to work with any size of documents what solutions are planned for addressing this issue?

There have been a separate discussion, in which Dwight Merriman (10gen) provided ☞ more details:

  • If your shard key has any correlation to insertion order, I think you are ok.
  • If you add new shards very early s.t. the source shard doesn’t have high load, i think you are ok.
  • If your objects are fairly large (say 4KB+), i think you are ok.
  • If the above don’t hold, you will need the defrag enhancement which we will do asap.

First point above, seems to confirm my last comment on this subject: sharding keys.

Since repairing shard0 and adding a third shard, we’ve set up even more shards, and now the check-in data is evenly distributed and there is a good deal of extra capacity.

Based on the fact that the sharding key is user ids, how exactly can you guarantee even distribution? As long as bringing up more shards doesn’t address immediately the issue of automatic balancing, wouldn’t you better shard based on data that grow continuously and that can show unpredictable evolution?

While it’s extremely interesting to hear all these details and I highly appreciate that Foursquare and 10gen engineers have decided to share these information (nb I’ve been trying to convince 10gen about this myself), I think there are still a few open questions.

Update: ☞ the Hacker News thread

Original title and link: Foursquare MongoDB Outage Post Mortem (NoSQL databases © myNoSQL)


MongoDB Case Study: From MySQL to MongoDB at GameChanger

Kiril Savino explains in detail the evoluton of GameChanger from using MySQL to MongoDB:

It was at about this point that I saw MongoDB hit version 1.1. Here’s what went through my head:

  • I can scrap the whole JSON->dict->Model->SQL->Table overhead
  • My underlying datastore can more exactly mirror my API structure
  • Denormalization is an inherent part of the Document model
  • I’ll get this whole scalability roadmap for free

We never looked back. It took 6 months to frankenstein our way from MySQL onto MongoDB entirely (I don’t believe in big-bang rewrites, so we built cross-DB bridging code, and migrated our data system-by-system across).

There’s a fragment that made me wonder if CouchDB wasn’t an option too:

But at GameChanger we’re tackling this hairy problem: we synchronize data across mobile devices and all kinds of wacky web end-points, and the whole things is built on a RESTy JSON-based HTTP API. Everything, including our website itself, consumes our API.

Kiril explains in a comment that what set off CouchDB was the lack of “secondary indexes or arbitrary querying”. Or differently put, they felt uncomfortable having to always use MapReduce.

Original title and link for this post: MongoDB Case Study: From MySQL to MongoDB at GameChanger (published on the NoSQL blog: myNoSQL)


MongoDB at SourceForge

Mark Ramm talks on why they chose MongoDB for SourceForge, how it compared to other possible solutions, the problems encountered, how they fixed them, overall lessons learned, and answering questions

From the same InfoQ series:

Original title and link for this post: MongoDB at SourceForge (published on the NoSQL blog: myNoSQL)


Wordnik Reports 9 billion Records on MongoDB

In a ☞ blog post Wordnik, a user of MongoDB, has announced they’re currently storing 9 billion records (nb I assume it’s documents).

Unfortunately a lot of details are missing:

  • what kind of data is stored?
  • why do they need 9 billion live documents?
  • how much of that is really accessed on a daily basis/monthly basis?
  • hardware, infrastructure, deployment details

If someone knows more about these, I’m sure the NoSQL community would be gracious to hear them.

Update" Tony Tam (@fehguy) pointed out a presentation about Wordnik which adds some little info to the above questions:

Presentation: Scalable Event Analytics with MongoDB & Ruby on Rails

We’ve already seen the analytics MongoDB case study before when looking how Eventbrite is tracking page views with MongoDB, but also in a MongoDB-based real time web traffic visualization tool called Hummingbird.

But Jared Rosoff’s presentation contains a series of slides which are identifying possible issues in each scaling approach:

  • single database
  • master-slave database
  • sharded database
  • key-value stores
  • key-value store with Hadoop for reporting
  • MongoDB

The only part I don’t really understand is how is using Hadoop

more complex than scaling MongoDB:

Maybe someone could explain?

Meanwhile, Jared Rosoff’s complete slidedeck below. From MySQL to MongoDB

It took me a while to understand the real reasons behind the decision, but finally got it: object-relational impedance mismatch[1]:

Huge amounts of our code (as much as 80%) was dedicated to converting UI Objects to and from Database objects

Video and slides below (note the above hint is on slide 18 and at around minute 10 in the video)

This post is part of the MongoDB Case Studies series.

Tekpub: Using both MongoDB and MySQL

You shouldn’t be afraid to use both NoSQL and RDBMS in your projects if they help you address real problems:

We split out the duties of our persistence story cleanly into two camps: reports of things we needed to know to make decisions for the business and data users needed to use our site. Ironically these two different ways of storing data have guided us to do what’s natural: put the application data into a high-read, high-availability environment (MongoDb) - put the historical, reporting data into a system that is built to answer questions: a relational data store.

The high-read stuff (account info, productions and episode info) is perfect for a “right now” kind of thing like MongoDb. The “what happened yesterday” stuff is perfect for a relational system.

We don’t want to run reports on our live server. You don’t know how long they will take - nor what indexing they will work over (limiting the site’s perf). Bad. Bad-bad.

Much better case study than this one!

This post is part of the MongoDB Case Studies series.


Tracking page views with MongoDB

After looking at 4 different alternatives — Google Analytics, sharding existing MySQL database, ETL process (nb log processing) and MongoDB, Eventbrite decided to go the MongoDB way dismissing the other approaches:

  • Google Analytics
    • Time-consuming to set up and test
    • Migrating existing page-view data is tricky
    • Not real time
  • Shard the MySQL table
    • Requires downtime to make schema changes
    • Introduces routing at the code level
    • Would still be prone to row locks if we outgrew # of shards
  • ETL process (aka, write to log file and have a process that aggregates and periodically writes to the database)
    • No data integrity
    • Not real time
    • Requires management of log files over multiple web servers

While the post doesn’t really detail the reasons why MongoDB would be able to solve this problem, there are a couple of MongoDB features described in this ☞ post that make it a good fit for this scenario.

Basically, the solution is based on a combination of the following MongoDB features:

  • upsert and $inc operators
  • write operations returning immediately (without waiting the server to effectively confirm a write)
  • $inc operator is a very good replacement for a read-modify-update scenario
As a side note, remember there’s another MongoDB based tool: Hummingbird that provides real time web traffic visualization.

This post is part of the MongoDB case studies series.


Replacing MySQL with MongoDB

Firstly, it is not replacing but using it together with MySQL:

Completely replacing MySQL with MongoDB hasn’t ever been on the table.

Second, the article doesn’t provide the real arguments for using MongoDB and it sounds like coolness factor-based adoption:

I looked over at our Chief Architect, Blake Carlson, said “Hey I wanna use MongoDB for this.” His reply was “Cool.”


We had a new application coming online (Vendor Connect) that had a certain set of requirements — stat tracking with “documenty” data — which seemed like a good candidate for a document store.


I haven’t done any performance testing, but it’s hard to compare apples to apples.

Third, the version mentioned (i.e. 1.0.0) is so old that I’m wondering what happened since then. We’ve already covered usecases in which scaling MongoDB was not as easy as some are expecting.

This post is part of the MongoDB Case Studies series.