Dropbox’s service has been affected over the weekend due to a faulty upgrade procedure that created duplicated master/slave MySQL setups:
When running infrastructure at large scale, the standard practice of running
multiple slaves provides redundancy. However, should those slaves fail, the
only option is to restore from backup. The standard tool used to recover
MySQL data from backups is slow when dealing with large data sets.
To speed up our recovery, we developed a tool that parallelizes the replay
of binary logs. This enables much faster recovery from large MySQL backups.
We plan to open source this tool so others can benefit from what we’ve
- A backup and restore strategy that is not continuously tested and timed is of (almost) no value for services that require high availability.
- This is a good example of why highly available services are choosing solutions where there are no special nodes.
Original title and link: MySQL backup improvements based on Dropbox’s recent outage