While some may learn a few new things or get a confirmation in the very details of the outage, what caught my attention in the Heroku’s postmortem analysis is the conclusions:
- higher sensitivity and more aggressive monitoring on a variety of metrics
- improved early warning systems
- better containment
- improved flow controls, both manual and automatic
- expanding simulations of unusual load conditions in our staging environment
None of these are particular to a specific storage or NoSQL database. But they all reflect the reality of operating at large scale where even the most operationally friendly solutions—think of Dynamo-inspired NoSQL databases—cannot and should not be left unmonitored or unsupervised or with no clear recovery strategies and processes in place.
In the NoSQL world, one of the most covered outages was the MongoDB outage at Foursquare. And in case you don’t remember the details, most of the circumstances that led to that event could have been prevented by having:
- better monitoring
- early warnings
- better operational procedures
Aren’t these two lists looking very alike?
Original title and link: What Can Be Learned From Heroku Outage Postmortem ( ©myNoSQL)