In any sufficiently complex software system, failure is inevitable. Given that this is the case, chaos engineering, also known as chaos testing, provides a method and tool-set to deliberately introduce failures and outages in a system. This approach was pioneered by Netflix, who first created their chaos engineering process in 2010 and posted about it in detail in 2014.
Chaos Engineering and the Simian Army
In chaos engineering, a set of automated processes, known collectively as a “Simian Army”, are used to introduce various types of system failures. The colorful naming of these tools evokes the mental image of chaos testing as a group of monkeys wreaking unexpected havoc in a data center, an event for which engineers must prepare as best they can.
Knowing for sure how a complex system will react to failures is practically impossible. The only way to predict the results of failures – especially catastrophic or cascading failures – is to have them happen. Therefore, creating those failures yourself – in a controlled way and at a time of your choosing – via chaos engineering is a valuable learning exercise.
Understanding the failure modes of your system is particularly important if you have high expectations around reliability, or if you are operating in a less reliable environment – on top of cloud infrastructure, for example. However, injecting chaos requires a certain level of preparedness. You might want to try it out in a pre-production environment first!