Why would anyone use Netflix’s Chaos Monkey?
Failures happen and they inevitably happen when least desired or expected. If your application can’t tolerate an instance failure would you rather find out by being paged at 3am or when you’re in the office and have had your morning coffee? Even if you are confident that your architecture can tolerate an instance failure, are you sure it will still be able to next week? How about next month? Software is complex and dynamic and that “simple fix” you put in place last week could have undesired consequences. Do your traffic load balancers correctly detect and route requests around instances that go offline? Can you reliably rebuild your instances? Perhaps an engineer “quick patched” an instance last week and forgot to commit the changes to your source repository?
GitHub repository is here
Original title and link: The Best Defense Against Major Unexpected Failures Is to Fail Often: Netflix Open Sources Chaos Monkey ( ©myNoSQL)