Embracing failure with continuous resilience

This post explains the concept of continuous improvement, and how we can build resilient systems in the cloud.

By Faye Ellis

Jun 08, 2023 • 8 Minute Read

Please set an alt value for this image...

Last week I was lucky enough to attend the AWS London Summit. One of the most popular talks on the agenda was Toward Continuous Resilience with Veliswa Boya. Here’s what I learned about "continuous resilience", and how it applies to building in AWS cloud.

Your keys to a better career

Get started with ACG today to transform your career with courses and real hands-on labs in AWS, Microsoft Azure, Google Cloud, and beyond.

Start a Free Trial

Why does resiliency matter?

In the fast-paced world we live in, successful businesses rely on lightning-fast delivery of resilient software to achieve their competitive edge. Survival of the fittest is determined by how quickly you can adapt and get your products and services to market.

But frequent changes to your software configuration also carry the risk of potentially breaking your application, bringing down your service, and ultimately losing customers. Traditionally, IT and engineering teams have spent lots of time and resources attempting to prevent outages like this. When I worked as a SysOps Administrator, and later as a Solutions Architect, this was certainly something I strived for. Nobody wants to be woken up in the middle of the night because of a broken application. Or suffer reputational damage from a badly managed service outage.

How can we use failure to our advantage when building resilient systems?

Instead of attempting to build systems that can never fail, why not focus on designing and building systems that are capable of fast, automated recovery when the inevitable happens? If we expect and anticipate systems to fail, we can create teams that are well-versed in handling failure scenarios. This, in-turn, gives us the freedom to make frequent software changes, safe in the knowledge that our applications will recover quickly and our teams know how to handle a crisis.

Introducing continuous resilience

Unfortunately, in IT things do break. Hardware is guaranteed to break eventually. People break things too – we’re only human! Anyone who has worked in a SysOps or DevOps role has a tale to tell. From simply running a command on the wrong server, to my favourite story about a Systems Administrator who removed the SSH software … from the server we could only connect to using SSH! We all make mistakes, but how we learn and grow from them is infinitely more interesting than dwelling on them.

By applying a DevOps mindset to build resiliency into our systems, we can maintain and continually improve resilience as our systems grow and change over time.

In their book, Resilience Engineering in Practice, Erik Hollnagel et al identify four essential capabilities that need to be developed when working toward continuous resilience:

Anticipate
Monitoring
Responding
Learning

Building in AWS: The four essential pillars for enabling continuous resilience

What can we do within our own environments to anticipate potential failures, monitor our systems effectively, respond appropriately to events, and learn from our failures? Let’s examine each of these four capabilities in the context of AWS.

1. Anticipate

Here are some great practices that will enable you to anticipate failures.

Introduce code reviews to improve the quality of your code. You can automate code reviews with Amazon CodeGuru. This tool uses machine learning to review and recommend improvements to optimize code quality.
Design applications using a failure-oriented approach. Imagine what could go wrong, and design systems that are resilient to the scenarios you have identified. For distributed systems with lots of components that talk to each other, use timeouts set to appropriate limits for your application, then configure retries with backoff and jitter in order to randomize the delay between retries.
Use safe deployment practices. Immutable deployments involve deploying changes to a brand new environment that is completely separate from production. This involves zero downtime for your production systems, and when the new environment is ready for live traffic, use Route 53 to route requests to the new environment. If you need to roll back, simply update Route53 to point back to the original environment. AWS CodeDeploy is a great tool that can help you to automate deployments in this way. Canary deployments are another way to safely roll out your changes. For instance, you can deploy using CodeDeploy, then use Route53 or an Application Load Balancer to route only a small proportion, say 5% of customer traffic, to your new application, and test in production for a small number of users before switching all traffic to the new environment.
Limit the impact of failures by reducing the blast radius of an event. So-called cell-based architectures are based on multiple instances of a service that operate in isolation from each other, so that the failure of one cell does not cause the failure of another. Consider microservices-based architectures, serverless systems that rely on multiple invocations of a Lambda function, or even stovepipe-type architectures consisting of load-balancing, compute, and database services operating as a single cell, independently from other cells, as shown in the diagram below.

2. Monitoring

Here are some great monitoring practices that will enable you to identify failures.

Observe your application using logs, metrics, and traces. This will help you identify failures as they happen, and perform root cause analysis.
Use AWS monitoring tools. For example, Amazon CloudWatch enables you to monitor and set alerts for errors and warnings in your application logs, and will alert you on point-in-time numeric metrics like CPU utilization. AWS X-Ray is great for troubleshooting distributed applications and actually traces requests as they pass through the different components of your system. X-Ray is also a great tool for understanding where latency and bottlenecks exist in your system.
Configure alarms. When performing a canary deployment, configure the same alarms and metrics you would use in production to be alerted of any deployment failure in your new canary environment.

3. Responding

According to John Allspaw, co-founder of Adaptive Capacity Labs (who are the experts in incident analysis), “Being able to recover quickly from failure is more important than having failures less often.”

Here are some great practices when responding to failures.

Build automated mechanisms to enable your systems to bounce back quickly.
Design event-driven architectures that automatically respond to events in your environment. For instance, you can use AWS Config to monitor the permissions on an S3 bucket. You can then configure an Amazon EventBridge rule that will trigger if the access permissions on the bucket are ever changed, then invoke separate Lambda functions to correct the permissions and send an SNS notification.

4. Learning

Here are some great practices to adopt when learning from failures.

Focus on understanding the problem. Where possible, implement changes to drive continuous improvement using automation. This will help prevent the same thing from happening again.
Perform a Correction of Errors (COE). This is similar to a post-mortem, where the team examines what happened, the impact, contributing factors, lessons learned, and corrective actions.
Implement chaos engineering. This involves injecting stress and failures into an application to expose blind spots, observe system behavior, and make improvements as an iterative process. Tools like AWS Fault Injection Simulator make it very easy to break your system in a controlled way. It uses pre-defined templates, for example to run CPU stress on an EC2 instance, or even bring down the instance to see how your system responds. I have a video on chaos engineering if you'd like to try out a hands-on project.