4 minute read
Did you know that passenger jets can fly without all their engines? They’re designed with resilience in mind and can keep flying quite safely if an engine fails. As software engineers we also want to make resilient systems, avoiding single points of failure that could take down the entire system. But how can we be sure that our software systems do in fact handle failure gracefully? Airplane engine redundancy is tested in a real live plane but in a safe, controlled way. We can take the same approach with our software systems, using an approach called chaos engineering.
Chaos engineering (sometimes also called chaos testing) involves artificially injecting failure into our live, production systems, as a way to proactively validate that those systems handle a degraded environment.
Enter the Chaos Monkey
The concept of chaos engineering originated at Netflix. They were migrating from data centers to less reliable cloud infrastructure and wanted to ensure that their systems could cope with this less stable environment. To test their resilience to failure, they created an automated process called a Chaos Monkey. This automated process would use cloud APIs to turn off production server instances at random – the software equivalent of letting a monkey wander around your data center randomly unplugging cables and urinating in power supplies. Over the years Netflix has added other agents of chaos to their “Simian Army”, which inject other types of failure at various scales. This goes all the way up to Chaos Kong, a process that simulates an entire AWS region going down.
Chaos Engineering is About Embracing Failure
Rather than attempting to prevent any failures from occurring – a fool’s errand in any complex system – the goal with chaos engineering is instead to improve the resilience of these systems in the face of inevitable failure. What’s worse than a plane’s engine catching fire? An inability to handle that failure.
Chaos testing can be used to verify that specific failure scenarios are handled correctly – a network partition between database nodes, for example. It can also be used to discover new, unknown failure modes. Teams can also use chaos engineering to rehearse their response to failure, similar to a fire drill that’s planned and announced ahead of time, and run during business hours.
While these artificial failures are typically injected into production systems, they’re done so in a controlled, relatively safe way. Aerospace engineers wouldn’t literally set fire to a plane engine – they’d power it off on a test flight, and be ready to power it back on in case something goes wrong. In the same way, chaos engineering is typically done at scheduled times (usually during core working hours for the teams that operate the systems being exercised), and the simulated chaos that’s introduced can be reverted if things start going sideways.
There are a lot of failure modes that we might want to simulate. Hard disks failing (or filling up), a server going offline, or a network connection failing. One of the most reliable ways to bring a system to its knees is to inject packet loss or artificial latency – an API call which responds at a rate of 1 byte a second will usually cause much more mayhem than an API call which doesn’t respond at all.
Knowing that aerospace engineers think engine failure is a possibility, and something that they’ve planned for, and tested against, makes me more comfortable in getting on a plane, not less. In the same way, engineers who apply chaos engineering create software that is more robust and resilient in the face of the inevitable failure which befalls any complex system.
Learn More About Modern Software Engineering, Testing, and Deployment
Interested in more ideas on reducing risk and accelerating change? Why not:
- Explore the chaos engineering ecosystem
- Consider breaking up with your staging environment
- Restore your team’s work/life balance by killing release nights
- Learn about Gene Kim’s 5 Ideals for Building Software
- And finally, check out the state of feature delivery in 2020