NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MongoDB Auto-sharding and Foursquare Downtime

Foursquare is one of the companies that we’ve previously mentioned as being excited about MongoDB geo support added with version 1.4. It looks like they’ve also been the users of MongoDB auto-sharding feature included in the 1.6 series, but that didn’t work so well for them:

Starting around 11:00am EST yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it. For the next hour and a half, until about 12:30pm, we tried various measures to ensure a proper load balance. None of these things worked. As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.

There are quite a few questions about what happened there with Foursquare’s sharded MongoDB:

  • why adding a new shard brought down the whole system?
  • what is the best approach of migrating live data?
  • why a shard reindexing was needed?
  • was there any data loss or data corruption?
  • are there ways to predict possible overloads on specific shards?

Once things will calm down I hope to be able to find some answers from Foursquare and 10gen (if you have contacts please hook me up).

Original title and link: MongoDB Auto-sharding and Foursquare Downtime (NoSQL databases © myNoSQL)