5 steps to take when the cloud goes down

Cloud failures can and will happen. Here's what you can do as a cloud engineer when the cloud goes down (other than, you know, panic).

By Clint Bonnett

Jun 08, 2023 • 9 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Imagine you’re a cloud engineer working for a large company, and you’re responsible for keeping the website up and available. You’ve done all that you can to ensure redundancy with failover, high availability, and more. You're enjoying your Monday night meal, when suddenly you get an alert that the website is down. Yikes!

You rush to investigate, and learn that the outage is not caused on your end (whew!). It’s due to the cloud provider’s authentication mechanism failing and you’re not sure how long this outage will last. You need to get the site online as soon as possible, but what exactly can you do?

In this article, we cover what you can do when the cloud goes down (other than, you know, panic).

What are the main causes of cloud services going down?

Cloud failures can and will happen, which is why providers offer 99% to 99.99% uptime, never 100%. The top cause of outages are software or configuration errors, according to the Uptime Institute. Other common reasons are networking or connectivity issues, and mechanical or electrical failures at the data center.

When software or configuration errors bring down the cloud, the issue can range from a bad deployment package, misconfiguration of the application, and more. An example of this kind of outage bringing down cloud services happened in February 2022 to Slack, when a configuration change to a database led to a widespread outage for about three hours.

Networking and connectivity is how the cloud is held together, so disruptions can cause all sorts of issues. In this category, the most common type of failures are related to configuration (are you noticing a trend?), change management, and third-party network provider errors.

If we look at an outage from January 2022 on Google Cloud, we can see that a configuration error caused a few hours of increased latency where “the checkpoint data was incorrectly missing a particular piece of configuration information; this was propagated to ~15% of the network switches serving us-west1-b.”

On the mechanical or electrical failure side, most of the outages are caused by an uninterruptible power supply failure or from a utility or generator failure. Looking at an AWS outage from July 2022, a power outage to an availability zone caused a widespread outage of about two hours.

What happens when the cloud goes down?

As we saw in the introduction scenario, when the cloud goes down it's rarely pretty or fun for us as the end users! It usually causes stress, anxiety, and a mad scramble to fix the issue, or find a backup solution or alternative.

In a best-case scenario, our website is only down for a couple of minutes and only affects a small number of users. However, if we look at the worst-case scenario, our website could be down for days and would affect all of our customers from using our site. This potentially could cost us not only data loss from our site crashing but also reputation damage, loss of business, and more.

A study was conducted back in 2015 by the Ponemon Institute where it was determined that on average, the cost of an outage per minute is nearly $9,000. A more recent study performed by the Uptime Institute found that more than half of the organizations they surveyed said that a recent outage cost more than $100,000!

Case Study: 2021’s Large-scale AWS Outage

We’ve seen a handful of smaller-scale outages earlier in this blog, but now let's take a look at one of the most recent large-scale outages.

On a cold December morning back in 2021, a large-scale AWS outage affecting multiple services took place. From 10:30 AM ET until approximately 9:40 PM ET (for full service recovery), several services including API Gateway, Fargate, EventBridge, and EC2 instances were affected. This caused widespread outages for several businesses and many Amazon services. People couldn’t order pizza and even AWS’s own service health dashboard was down.

So what happened to cause all of this mayhem?

In short, an automated system in Amazon’s “us-east-1” region (North Virginia) tried to scale up an internal service running on AWS’s internal network. Unfortunately, there was an issue with this automated process and it flooded the network with traffic basically causing an unintentional Distributed Denial of Service (or DDoS attack).

In order to fix the issue, Amazon engineers first tried to move DNS traffic away from the congested paths. While this seemed to have helped the issue, it was not the solution. Next, they disabled event delivery for EventBridge to help reduce the load on the affected network devices. At this point the congestion started improving and before long, AWS operators reported “all network devices fully recovered by 2:22 PM Pacific Standard Time.” However, some services still took a while to fully stabilize, namely API Gateway, Fargate, and EventBridge.

With any outage or IT issue, it should result in some lessons learned and takeaways for the future. For the AWS team, they resolved to fix the automated process bug and improve communication with customers during an outage like this. If you would like to learn more about the AWS outage, checkout the blog post here at ACG by Mattias Andersson.

5 steps to take when the cloud goes down

Now that we’ve seen the effects and aftermath of cloud outages, how should you prepare for the next outage? Let’s walk through five steps you can take when the cloud goes down.

1. Before the cloud outage: Consider a multi-cloud strategy

First up, before an outage even happens, something to consider is a multi-cloud strategy for your environment. Now there are a few pros and cons to this approach as depending on your environment, architecture, and teams, a multi-cloud strategy might be more of a burden than a boon.

Another alternative you might consider is making use of multiple regions with your preferred cloud provider. This gives you increased redundancy and provides protection from regional outages without having all of the baggage from having multiple cloud providers.

2. Before the cloud outage: Backup essential data

Second, before an outage occurs, you should be making sure to backup your essential data.

If you use Azure, Azure Backup is a solution that will backup data on your VMs, SQL servers, Azure Blobs, and more.
On the AWS side, you can use AWS Backup which supports the RDS service, EC2 instances, and much more.
With GCP, Google Cloud Backup and DR is going to keep you protected by backing up your data in GKE, VMs, and a whole lot more.

If you have all of your essential data backed up before an outage, you’ll be able to restore it if there is data loss due to the outage or if the outage lasts for multiple hours or days.

3. Check for user errors first

For our third step, we can look at what to do after an outage has occurred. At this point, the best thing to do is determine if the issue is just on your end or not.

The fastest way to rule out an issue with your internet or connection is to head over to the Down Detector and put in the URL to the website. Down Detector will let you know if any other users are reporting errors and if there is a widespread outage. They also include helpful links to the website’s support page, twitter, or facebook, if available.

Another helpful tool that will quickly check if a website is down and to help you rule out local connectivity issues is IsItDownRightNow.com. Is It Down Right Now will help you determine if the site you are checking is available and what the response time for the site looks like.

If those detectors are not revealing any issues, you can check on the status page of your cloud provider. For example, to check on Google Cloud’s status, you can head to their status page that will reveal if they are having any service issues or service degradations. These status pages will sometimes contain updates about the issue, how long until resolution, and what steps are being taken to resolve the issue.

If the internet is completely down on your end or the power goes out, you can head to a local coffee shop and use their wifi and check to see if the provider is truly having an outage. Once you have ruled out any problems locally, we can move on to the next step in our list.

4. Contact your cloud provider

After we have ruled out any local connectivity issues, we can go ahead and contact the cloud provider to get more information on the outage. Be prepared to provide specific information about the issue you are experiencing, including what services are affected, any error messages, and what time the issue started.

Each provider has a different method for contacting support and multiple ways to contact them:

For Azure, you can use the Azure Portal or tweet Azure Support on Twitter. The latter is particularly helpful to get a quick response.
With AWS you can use the AWS support page or tweet AWS Support on Twitter.
Google Cloud gives you the option to use their support page.
If you are not using one of the big three cloud providers, then the easiest way to find out support information is on the providers website or by using Down Detector and the providers site, which will usually have a link to the support site for that provider.

Once you have contacted the provider, please remember to be patient. During an outage the providers support team is scrambling to help customers and answer questions so it may take a bit for a response.

5. Check your cloud service agreement

Finally, you’ll want to check your provider's cloud service agreement. Your agreement is important to review to understand the providers obligations and your rights as a customer.

First, you’ll want to check your service level agreements (SLAs). An SLA is a commitment from the provider to maintain a certain level of availability. For example, if you are using AWS and your API gateway service is impacted, AWS has three levels of SLAs for the API gateway service. Depending on how much downtime that service has experienced in a specific month will entitle you to a partial or full refund.

Let's say the API gateway service was down for three hours earlier in the month, that equates to about 99.58% uptime. According to the SLA provided by AWS, you would be entitled to a 10% service credit. So, make sure you are reviewing your cloud service agreements!

Develop a multi-cloud strategy to protect your data

Cloud outages can be frustrating for anyone that relies on cloud services to perform their daily activities or run their business. By following the steps and resources in this article, you’ll be better prepared for an outage. However, as we have seen, cloud outages can and will happen at any time.

In order to protect your business from an outage, you should determine if you can engineer your application or services to run from multiple regions either in an active-active style or active-passive where you can failover to another region when there is an issue.

If you are still concerned about downtime with a single cloud provider, the next step is to develop a multi-cloud strategy to protect your data. You’ll want to make sure that you have the right people and processes in place to make this strategy a success, and we recommend reviewing the pros and cons of going multi-cloud.

Clint B.

More about this author