Graceful Degradation: Building Planned Failure Into Your App

Adil Aijaz on September 6, 2016

When we think about app infrastructure planning, we often ask how will it scale. Equally important though, is how will it fail. You might not be able to 100% prevent failure, but you can mitigate its impact on your customers by building the the capacity for failure, or graceful degradation, into your app.

I find that the best writing on software architecture is often by people who are not software architects. In fact, they may never have written a single line of code. This is because the goal of good software is to fit the needs of the customer so well that the software is invisible (quote repurposed from Donald Norman). And that is a goal we, software engineers, share with designers and marketers whose perspective is also relevant to the design of software.

Seth Godin’s article on Graceful Degradation is a great example of such writing. He says:

Most failures aren't shocking surprises. The law of large numbers is too strong for that. Instead, they are predictable events that smart designers plan for, instead of wishing them away as rare unpredictable accidents.

Failure is not an exception in software, it is the rule. That is why ‘Graceful Degradation’ is such a key concept in software architecture.

Graceful degradation in cloud software is a wide ranging topic encompassing people, code, and infrastructure. I will highlight four ideas that are critical to graceful degradation:

  • Service Oriented Architecture: or its modern variation of microservices. Independent services allow software architects to localize failure to a single service, thus, preventing failure in non-critical functionality from disrupting critical functionality.

  • Elastic Hardware: Sudden spikes in traffic are a significant source of failures in cloud systems. Hardware that can spin up on demand goes a long way in solving this failure scenario.

  • Fault Tolerant Communication: If a service stops meeting its SLA, calling services should taper calls to it and resort to a backup behavior. This prevents failures from cascading. A good example is Netflix’s Hystrix

  • Controlled Rollouts: Every new feature should be rolled out to a subset of traffic and gradually ramped to all customers. Through this rollout, its impact should be measured on key performance and business metrics. If a metric degrades, the feature should be ramped down. Thus, problems can be ironed out without risking global customer experience. A good example of controlled rollouts is Split.

How do you ensure Graceful Degradation in your systems? Have you come across non-engineers whose writings have influenced your thinking about software? Drop us a line!

We're Hiring!

Join the Team