Hadoop Chaos Monkey: The Fault Injection Framework
Do you remember the 5 lessons Netflix learned while using the Amazon Web Services—judging by how much Netflix shared about their experience in the cloud including Amazon SimpleDB I’d say these 5 are only the tip of the iceberg—where they talked about the Chaos Monkey?
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Hadoop provides a similar framework: Fault Injection Framework :
The idea of fault injection is fairly simple: it is an infusion of errors and exceptions into an application’s logic to achieve a higher coverage and fault tolerance of the system. Different implementations of this idea are available today. Hadoop’s FI framework is built on top of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
As a sidenote, this is one of the neatest usages of AspectJ I’ve read about.
Update: Abhijit Belapurkar says that Fault injection using AOP was part of Recovery Oriented Computing research at Stanford/UCB many years ago: JAGR: An Autonomous Self-Recovering Application Server.
Original title and link: Hadoop Chaos Monkey: The Fault Injection Framework (©myNoSQL)
