Twilio is using for its billing system a combination of Redis, for temporary state tracking, and a relational database, as the primary source of truth. They’ve recently had a critical incident with this system and posted a detailed timeline and analysis.
Redis has been involved in this system malfunction and Salvatore Sanfilippo wrote about a couple of things that could be improved from a Redis perspective.
We experienced a loss of network connectivity between all of our billing redis-slaves and our redis-master. This caused all redis-slaves to reconnect and request full synchronization with the master at the same time.
Redis 2.6 always needs a full resynchronization between a master and a slave after a connection issue between the two. Redis 2.8 addressed this problem, but is currently a release candidate, so Twilio had no way to use the new feature called “partial resynchronization”.
Observing extreme load on the host, the redis process on redis-master was misdiagnosed as requiring a restart to recover. This caused redis-master to read an incorrect configuration file, which in turn caused Redis to attempt to recover from a non-existent AOF file, instead of the binary snapshot. As a result of that failed recovery, redis-master dropped all balance data. In addition to forcing recovery from a non-existent AOF, an incorrect configuration also caused redis-master to boot as a slave of itself, putting it in read-only mode and preventing the billing system from updating account balances.
Fortunately Redis 2.8 provides a better workflow for on- the-fly configuration changes […] Basically the config rewriting feature will make sure to change the currently used configuration file, in order to contain the configuration changes operated by CONFIG SET, which is definitely safer.
Services relying on the redis-master began to fail due to the load generated by the slave synchronization.
This seems to be an unconfirmed problem and Salavatore ran a test to double check it:
Actually for the way Redis works a single slave or multiple slaves trying to resynchronize should not make a huge difference, since just a single RDB is created. […] However what is true is that Redis may use additional memory with many slaves attaching at the same time, since there are multiple output buffers to “record” to transfer when the RDB file is ready. This is true especially in the case of replication over WAN.
- I’m wondering if chained replication, where some of the slaves would be masters for other slaves, could have help alleviate the problem and not lead to the restart of the main master.
- Another idea could be to have a primary slave (that could also serve as a hot standby) and have all the slaves connect to it. The pressure on the master would be relieved and there would always be a hot standby. The downside might be the lag by which the end-of-chain slaves would be behind.
Original title and link: Redis at Twilio and What can be learned from an incident ( ©myNoSQL)