Cloud reliability: Defeating Murphy's Law with chaos engineering

You can improve the resiliency of your systems by introducing some controlled chaos. Pluralsight Principal Author, Cloud Amy Coughlin explains how.

By Amy Coughlin

Jun 19, 2025 • 9 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Table of Contents

Design patterns as your first line of defense
Chaos Engineering: Your organization’s four-leaf clover
Preparation steps for implementing chaos engineering
Begin the experiment! How to conduct Chaos Engineering
Why you should use chaos engineering tools (and what to look for)
All about AWS and Azure’s chaos engineering tools
Conclusion

Ever hear of Murphy’s Law? For those that haven’t, it’s an old adage that goes “Anything that can go wrong, will go wrong, and at the worst possible time.” Similar versions of this same sentiment have been around since at least the 19th century. However, the attribution to one Mr. Murphy goes back to the 1940s, when Captain Ed Murphy, an engineer at Edwards Air Force Base, was dealing with a technician who mistakenly wired a rocket sled backwards. Disgusted at the lack of attention to detail, Murphy said, “If there’s any way these guys can do it wrong, they will.”

Despite all this happening nearly a century ago, Murphy’s Law is still true today, especially in Cloud Computing. There’s always the chance of metaphorical and literal wires getting crossed: with the cloud, we’re sharing platform services with millions of other users, competing for resources and bandwidth, running it all over the the internet on commodity hardware and expecting instant results.

To make matters worse, system resiliency and reliability are no longer just the domain of system administrators. The responsibility falls to you, me, and anyone else who wants to deploy to a cloud platform. But hey, no pressure! Before we both curl into the fetal position, let me assure you that help is at hand. In fact, embracing the chaos is one way to inoculate your systems against it, like a dose of helpful antibodies. This helpful practice is called Chaos Engineering.

But before we delve into that, let’s cover how systems are typically designed, and where the current processes often runs into Murphy’s Law.

Design patterns as your first line of defense

Over my thirty-year career, I’ve programmed a lot, and had the cloud fail me more times than I can count. Through it all, I’ve developed a love for design patterns. Design patterns are generalized solutions to common yet tricky software and infrastructure design challenges.

I’m truly grateful for those people who have thoughtfully anticipated my future cloud hiccups and patterned around them. And fortunately, the major cloud providers have published dozens of cloud-specific patterns, and even have a special category for resiliency. Resiliency is the ability to gracefully and quickly recover in the face of inevitable failure, and it’s a crucial concept in cloud computing. Here are three examples of these design patterns.

1. Queue-based load leveling pattern

Sounds like a mouthful, right? But this is a surprisingly simple concept. When you’re using a service like Azure Service Bus, sending messages between your applications, this pattern uses a queue—a temporary storage container for messages—to act as a buffer when the message consumer fails. That way, it can still keep up with the flow of messages.

2. Retry pattern

The retry pattern is used to address temporary service connection issues. When an application fails to connect to an external service, it waits and retries a set number of times, hence the name. Anyone who’s used an internet-enabled device has seen this pattern at work at least once.

At some point, though, retries consume limited resources which can lead to a domino effect of failures. What then? That’s when our third pattern, the circuit breaker pattern, comes into play.

3. Circuit breaker pattern

Much like a household breaker, a circuit breaker stops a retry pattern from running indefinitely, which could eventually take down the whole system. It’s exactly like flipping a specific breaker to avoid impacting every device on that circuit, but in this case, it’s your cloud service.

… However, all of these patterns don’t anticipate unbridled chaos

As I said, these design patterns are great, but protecting cloud architecture takes more than applying mitigation strategies at a micro level, particularly in an enterprise system with distributed architecture. You can’t always anticipate what form of chaos will require those resiliency measures – that’s why we call it chaos.

What you can do is better anticipate how a broader system is impacted when disaster strikes. Enter Chaos Engineering.

Chaos Engineering: Your organization’s four-leaf clover

Sometimes, the failure of your applications and systems are the results of a targeted attack. Other times it’s just plain bad luck. Either way, to combat Murphy’s law, you can use Chaos Engineering to ward off misfortune ahead of time.

Shelving metaphors for specifics, chaos engineering is the disciplined approach of testing your system’s integrity by intentionally disrupting it, and identifying any system weaknesses through the process. “Disciplined” is the key word here—you’re certainly not wreaking uncontrolled havoc just for fun.

This process produces a number of benefits: it tests your monitoring practices and disaster recovery plans, but more importantly, it reveals what are called “known-unknowns” and “unknown-unknowns.”

What are “known-unknowns” and “unknown-unknowns” in chaos engineering?

Known-unknowns are potential threats or failures we’re aware of, but we don’t know when or how they will hit. Meanwhile, unknown-unknowns are things we’re blissfully unaware of, both as a threat and how and when it will strike. The latter is a far scarier threat to your organization, because you’re completely unprepared for it.

Preparation steps for implementing chaos engineering

Before you jump head first into the chaos, there are four preparation steps you’ll want to take first. If you’re lucky, your team may already be doing some of them as part of your threat protection and disaster recovery practices. These are threat modeling, resiliency planning, health modeling, and Failure Mode Analysis. Think of them as the four different leaves of our lucky clover.

1. Threat modeling

Threat modeling involves examining a system to identify potential security threats and vulnerabilities in order to develop plans for mitigating those threats and building in resilience measures.

2. Resiliency planning

Resiliency planning and implementation should evolve out of threat modeling. It involves specific actions you plan to take to reduce the threat surface, such as isolation strategies, design pattern implementation, and improvements to monitoring and alerting systems.

3. Health modeling

Health modeling focuses on understanding the system’s vital signs. This is key to chaos engineering because chaos engineering experiments always start with a hypothesis that your systems will recover to a reasonably healthy state. But you can’t really measure that if you don’t have metrics around what constitutes a “healthy state” for your system.

4. Failure Mode Analysis (FMA)

FMA is similar to threat modeling, but it is more focused on identifying and mitigating potential failures in order to improve reliability and product quality. FMA also sometimes involves a ranking, or prioritization, of potential risks. This can help drive:

The design of chaos experiments
How often you will conduct them
Whether they are best conducted in test environments, production environments, or both

Begin the experiment! How to conduct Chaos Engineering

This is the part where you get to throw up your hands like a mad scientist and get to work. Well, actually, it’s a bit more like traditional science, if you took this subject in school. Let’s go through the steps.

1. First, you start with a purpose, and assert a hypothesis

Usually, this hypothesis is the same one for all chaos engineering experiments: “If this chaotic event I’m simulating happens to the system in question, the system’s components will remain in a healthy state—or at least operational.”

The only difference among experiments is that the metrics and thresholds for what constitutes a “healthy state” are going to be. These will be different based on the specifics of your system and the purpose of the experiment.

2. You design a procedure to test this theory

“How can I imitate this chaotic occurrence?” This is where the mad science comes in.Suppose you wanted to understand how a system reacts when a specific virtual machine shuts down? Your experiment could involve a simple, direct act, where you purposely shut down the machine for a set period of time, and then observe how the broader system responds.

Warning: This is not a drill!

It’s important to note we’re not simulating a VM shut-down: you are really shutting it down. This would be a direct way to inject chaos into our system, but we can also be a bit more clever in how we design an experiment.

For example, if you wanted to test the impact of a vital service going down, you could throw up a firewall rule or an NSG rule that disallows access from the other components of the system served by that vital service. It would have a similar effect of a service shut-down, with a smaller impact to the service itself. This is an indirect way of injecting chaos.

However, make no mistake: this isn’t a simulation. In terms of testing the related components, you are actually impacting the operation of those components — especially if you conduct the experiment in production. Yes, chaos engineering in production is allowed, and even sometimes preferable. There are cases where you need to inject chaos into a production environment to ensure real workload demands, the impact of external APIs out of your control, or to account for the potential for constantly evolving threats from bad actors.

3. You run the experiment, and make observations

Chaos ensues (or doesn’t). You take notes based on the metrics and thresholds you decided on in step one, seeing if your system indeed remains in a healthy state.

4. Depending on your results, you change the variables

If your system remained in a healthy state, congratulations! You’ve demonstrated that, given the stress level and time period of time you tested, you withstand said chaotic event. Perhaps next time, you can introduce even more stress for a longer time period. If it doesn't remain healthy, naturally you’ll want to come up with ways to improve your system’s resilience based on how it failed.

5. If it failed, run the experiment all over again

Test your healthy state hypothesis until it is proven correct. Once it is, you have peace of mind it can likely handle that particular disaster scenario.

Why you should use chaos engineering tools (and what to look for)

While you can conduct chaos experiments without tooling, such experiments risk becoming uncontrollable. Tools help you achieve controlled chaos and avoid real damage. When selecting a tool, they typically include the following features:

Identity and access management (IAM): This helps control who can create experiments, who can run them, and in which environments
Safety rails on components: This helps control which components of a system are allowed to be compromised
Safety rails on methods: This helps limit on which specific types of fault injection are allowed
Damage control features: Things like automatic roll-backs to a healthy state, and a “kill switch” if things go even worse than expected

One of the earliest tools, Chaos Monkey, was developed by Netflix to test random shut-downs of production machines and containers. This tool is maintained as an open-source project in GitHub, and can be used on a cross-platform basis.

All about AWS and Azure’s chaos engineering tools

If you’re using one of these two major cloud providers, you don’t have to stray far to find tools for your chaos engineering: both AWS and Azure have tools designed to work with their particular platform, including some PaaS and IaaS services. Here are two courses on Pluralsight that can help you out:

And two articles that might be of interest to you:

Conclusion

Hopefully after reading this article, you’ve now got a high-level understanding of what Chaos Engineering is, the benefits it provides for improving the resilience of your systems, a general idea of how to go about it, and what you should look for in chaos engineering tools (Highly recommended!)

Amy C.

Amy Coughlin is a Pluralsight Author and Senior Azure Training Architect with over 30 years of experience in the tech industry, mainly focused on Microsoft stack services and databases. She's living the dream of combining her love of technology with her passion for teaching others.

More about this author