We are finally getting some details about how and why MongoDB brought down Foursquare:
On Monday morning, the data on one shard (we’ll call it shard0) finally grew to about 67GB, surpassing the 66GB of RAM on the hosting machine. Whenever data size grows beyond physical RAM, it becomes necessary to read and write to disk, which is orders of magnitude slower than reading and writing RAM. Thus, certain queries started to become very slow, and this caused a backlog that brought the site down.
- have writes been configured to use the MongoDB default fire and forget behavior? In that case it wouldn’t matter so much for the request processing that the write would go to disk
- if replicas were available have reads been distributed among these?
- if no replicas were available, how quick and what would be the quickest approach to bring up read only replicas?
We first attempted to fix the problem by adding a third shard. We brought the third shard up and started migrating chunks. Queries were now being distributed to all three shards, but shard0 continued to hit disk very heavily. When this failed to correct itself, we ultimately discovered that the problem was due to data fragmentation on shard0. In essence, although we had moved 5% of the data from shard0 to the new third shard, the data files, in their fragmented state, still needed the same amount of RAM.
- why the 3rd shard could accommodate only 5% of the data?
This can be explained by the fact that Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page. Removing 5% of these just made each page a little more sparse, rather than removing pages altogether.
- I might be wrong, but it sounds like the problem here is that chunks of data are not using contiguous space. Considering that MongoDB is supposed to work with any size of documents what solutions are planned for addressing this issue?
There have been a separate discussion, in which Dwight Merriman (10gen) provided ☞ more details:
- If your shard key has any correlation to insertion order, I think you are
- If you add new shards very early s.t. the source shard doesn’t have high
load, i think you are ok.
- If your objects are fairly large (say 4KB+), i think you are ok.
- If the above don’t hold, you will need the defrag enhancement which we
will do asap.
First point above, seems to confirm my last comment on this subject: sharding keys.
Since repairing shard0 and adding a third shard, we’ve set up even more shards, and now the check-in data is evenly distributed and there is a good deal of extra capacity.
Based on the fact that the sharding key is user ids, how exactly can you guarantee even distribution? As long as bringing up more shards doesn’t address immediately the issue of automatic balancing, wouldn’t you better shard based on data that grow continuously and that can show unpredictable evolution?
While it’s extremely interesting to hear all these details and I highly appreciate that Foursquare and 10gen engineers have decided to share these information (nb I’ve been trying to convince 10gen about this myself), I think there are still a few open questions.
Update: ☞ the Hacker News thread
Original title and link: Foursquare MongoDB Outage Post Mortem (NoSQL databases © myNoSQL)