MongoDB Auto-sharding and Foursquare Downtime
Foursquare is one of the companies that we’ve previously mentioned as being excited about MongoDB geo support added with version 1.4. It looks like they’ve also been the users of MongoDB auto-sharding feature included in the 1.6 series, but that didn’t work so well for them:
Starting around 11:00am EST yesterday, we noticed that one of these shards was performing poorly because a disproportionate share of check-ins were being written to it. For the next hour and a half, until about 12:30pm, we tried various measures to ensure a proper load balance. None of these things worked. As a next step, we introduced a new shard, intending to move some of the data from the overloaded shard to this new one.
There are quite a few questions about what happened there with Foursquare’s sharded MongoDB:
- why adding a new shard brought down the whole system?
- what is the best approach of migrating live data?
- why a shard reindexing was needed?
- was there any data loss or data corruption?
- are there ways to predict possible overloads on specific shards?
Once things will calm down I hope to be able to find some answers from Foursquare and 10gen (if you have contacts please hook me up).
Original title and link: MongoDB Auto-sharding and Foursquare Downtime (NoSQL databases © myNoSQL)
via: http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/