A Simple Guide to Chaos Engineering

Blog > Experimentation, Feature Management

May 19, 2020

Did you know that passenger jets can fly without all their engines? They’re designed with resilience in mind and can keep flying quite safely if an engine fails. As software engineers we also want to make resilient systems, avoiding single points of failure that could take down the entire system. But how can we be sure that our software systems do in fact handle failure gracefully? Airplane engine redundancy is tested in a real live plane but in a safe, controlled way. We can take the same approach with our software systems, using an approach called chaos engineering.

Chaos engineering (sometimes also called chaos testing) involves artificially injecting failure into our live, production systems, as a way to proactively validate that those systems handle a degraded environment.

Enter the Chaos Monkey

The concept of chaos engineering originated at Netflix. They were migrating from data centers to less reliable cloud infrastructure and wanted to ensure that their systems could cope with this less stable environment. To test their resilience to failure, they created an automated process called a Chaos Monkey. This automated process would use cloud APIs to turn off production server instances at random — the software equivalent of letting a monkey wander around your data center randomly unplugging cables and urinating in power supplies. Over the years Netflix has added other agents of chaos to their “Simian Army”, which inject other types of failure at various scales. This goes all the way up to Chaos Kong, a process that simulates an entire AWS region going down.

Chaos Engineering is About Embracing Failure

Rather than attempting to prevent any failures from occurring — a fool’s errand in any complex system — the goal with chaos engineering is instead to improve the resilience of these systems in the face of inevitable failure. What’s worse than a plane’s engine catching fire? An inability to handle that failure.

Chaos testing can be used to verify that specific failure scenarios are handled correctly — a network partition between database nodes, for example. It can also be used to discover new, unknown failure modes. Teams can also use chaos engineering to rehearse their response to failure, similar to a fire drill that’s planned and announced ahead of time, and run during business hours.

While these artificial failures are typically injected into production systems, they’re done so in a controlled, relatively safe way. Aerospace engineers wouldn’t literally set fire to a plane engine — they’d power it off on a test flight, and be ready to power it back on in case something goes wrong. In the same way, chaos engineering is typically done at scheduled times (usually during core working hours for the teams that operate the systems being exercised), and the simulated chaos that’s introduced can be reverted if things start going sideways.

There are a lot of failure modes that we might want to simulate. Hard disks failing (or filling up), a server going offline, or a network connection failing. One of the most reliable ways to bring a system to its knees is to inject packet loss or artificial latency — an API call which responds at a rate of 1 byte a second will usually cause much more mayhem than an API call which doesn’t respond at all.

Knowing that aerospace engineers think engine failure is a possibility, and something that they’ve planned for, and tested against, makes me more comfortable in getting on a plane, not less. In the same way, engineers who apply chaos engineering create software that is more robust and resilient in the face of the inevitable failure which befalls any complex system.

Learn More About Modern Software Engineering, Testing, and Deployment

Interested in more ideas on reducing risk and accelerating change? Why not:

Explore the chaos engineering ecosystem
Consider breaking up with your staging environment
Restore your team’s work/life balance by killing release nights
Learn about Gene Kim’s 5 Ideals for Building Software
And finally, check out the state of feature delivery in 2020

As always, if you’re looking for more great content like this, we’d love to have you follow us on Twitter @splitsoftware, and subscribe to our YouTube channel.

Get Split Certified

Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.

Switch It On With Split

The Split Feature Data Platform™ gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Switch on a free account today, schedule a demo, or contact us for further questions.

Want to Dive Deeper?

We have a lot to explore that can help you understand feature flags. Learn more about benefits, use cases, and real world applications that you can try.

Blog

Code