SRE: How making systems observable improves their reliability

If you want to implement Site Reliability Engineering (SRE) principles and make your systems more reliable, you should start by making them observable, first.

By Karun Subramanian

Apr 10, 2024 • 8 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

If your application provides a poor experience, and you’re not there to observe it, do your users make a sound?

The answer is a resounding “yes”, and it probably involves muttering and swearing. A reliable service experience goes hand in hand with user experience and trust, and is a key goal of Site Reliability Engineering (SRE). However, you can’t make sure these services are meeting user expectations if you can’t see what’s going on.

This all-important principle is called observability. Observability is about measuring health metrics accurately and consistently, using the right tools, processes, and procedures to instrument, collect, organize, and analyze your telemetry data. Without observability, you can’t achieve reliability, because you are quite literally flying blind.

In this article, we’ll delve deeper into the concepts of observability and reliability, and how you can use this knowledge to build dependable systems.

What is reliability in SRE?

Reliability refers to a specific set of performance levels for a service, and how well we achieve those targets. In Site Reliability Engineering, these performance targets are called Service Level Objectives (SLO). These SLOs must be measurable using Service Level Indicators (SLI), which are the real numbers used to gauge performance.

Below are two examples of SLOs:

99% of requests succeed within 800 milliseconds for rolling four weeks.
95% of files are processed with less than 25 failed records per day.

Can you notice the SLIs? In this case, the SLIs are the response times (latency of within 800 milliseconds) and the number of failed records (less than 25). These are measurable, numeric indicators of reliability.

Your SLIs should be simple, and you should choose the right metrics to track. Tracking too many metrics that don’t matter can overcomplicate things, and hide the metrics that really matter.

Is availability the same as reliability in SRE?

No. Availability is the percentage of the time a service successfully serves requests, while reliability is how well it functions upon request. Being available is part of building reliable systems, but not the only factor. For example, a system experiencing high latency wouldn’t be very reliable, even if it was highly available.

However, this doesn’t play in reverse — a system that is constantly unavailable can’t be considered reliable. As such, availability may be one of your SLOs for achieving reliability. To this end, it’s useful to know how to measure availability, as shown below.

How to calculate service availability

You can easily calculate availability using the following formula:

You can also calculate availability using MTTR (Mean Time to Repair) and MTTF (Mean Time to Failure). MTTF refers to the average time a system is functional before experiencing an outage. MTTR refers to the average time it takes to recover a system from an outage.

Since calculating MTTF and MTTR accurately can be challenging, the more straightforward approach — using uptime and downtime — is usually used to calculate availability.

Now that we’ve covered reliability and availability, let’s loop back to observability.

What is observability in SRE?

servability is the ability to figure out the inner state of a system based on the data you can collect from it. In Information Technology, this means the ability to measure the health of a service using the telemetry data collected from it. This service can be a web application, a batch processing application, a database system, or any piece of software that provides a service.

How do you achieve observability?

To implement observability, we must have the means to generate, collect, and organize telemetry data. We must carefully consider the implementation of observability during the initial stages of application development. In practice, observability is achieved by using purpose-built tools available in open-source and commercial vended forms.

Some examples of observable products are Splunk, Dynatrace, Datadog, ELK (Elastic Search, Logstash, Kibana), and TIG stack (Telegraph, Influx DB, Grafana).

The pillars of observability: Logs, metrics, and traces

There are three types of telemetry outputs that observability systems provide, commonly referred to as the “pillars of observability”: logs, metrics, and traces.

A log is an immutable, timestamped record of discrete events that happen over time. These record events, warnings, and errors.
A metric is a numeric representation of data measured over intervals of time. These might be transactions per second, memory resource consumption, or other useful figures.
A trace is a data that tracks an application request as it flows through various parts of an application. Using a trace, you can identify which parts of an application trigger an error.

Example: A typical observability system

Systems are instrumented with agents that can generate telemetry (example: Dynatrace oneagent). Alternatively, a system can expose telemetry that can be pulled using a telemetry collector.
A telemetry collector receives telemetry such as logs, metrics, and traces.
The telemetry pipeline can optionally enhance or filter data and export data to observability backends. The pipeline can take several forms depending on the observability solution chosen.
The observability backend is where telemetry is organized for consumption. Examples include Splunk, Dynatrace, and Datadog.

How you can use observability to enhance service reliability

1. Reducing the MTTR (Mean Time To Repair)

When a system fails, either entirely or partially, the speed in which you can find out what went wrong is crucial to restoring it to full service. This observability directly impacts the MTTR, which in turn is needed to maintain high reliability. Using the telemetry signals collected and aggregated by an observability system, a team can quickly analyze the data and locate the root cause of an issue.

Here is an example. Your users start reporting they’re receiving the deadly 500 Internal Server or 502 Bad Gateway errors. Using your observability system, you quickly discover what is going wrong in the system. It may report back that a remote web service is dead, a backend database server is not responding, or a file system could be full. All of a sudden, you’ve shifted from asking questions about what went wrong, to planning how to fix it ASAP.

Observability can also help you catch potential issues ahead of time. Going back to the example above, it could help you flag when your file system is reaching capacity before it becomes a problem. By proactively monitoring system health and utilizing well-defined alerts, we can start troubleshooting before end users are impacted.

2. Providing code-level diagnostics to solve issues faster

When developers need to diagnose an issue with an application, finding the offending bit of code can be like finding a needle in a haystack. Meanwhile, while they’re doing this, the system isn’t as reliable as it could be.

By using traces, you can significantly help provide code-level visibility into an application. By collecting traces from distributed systems and providing a user interface to visualize them, observability aids developers by enabling request-scoped visibility (typically a waterfall diagram) across services.

Below is a trace view from Dynatrace:

Source: dynatrace.com

Code-level diagnostics are so powerful that sometimes, the development teams are surprised to learn that a backend call was in the application execution path that they were not previously aware of.

3. Helping shift development focus to reliability improvements

Development teams typically strive to release as often as possible, so they can add new application features. Conversely, support teams such as SRE Application Support (Operations) tend to do the opposite and want to limit the number of changes to applications. The latter is often correct, as changes cause 75% of outages.

By monitoring SLOs, we can tell when one is breached or about to be breached. Armed with this data, we can drive the development focus to be on fixes that enhance reliability, instead of working on new features. Additionally, we can determine if we need to put a pause on future releases to maintain an application’s reliability.

For example, consider an application with an SLO of 99.9% availability for a rolling period of four weeks. 99.9% amounts to 43.2 minutes per month. If an outage consumes 35 minutes, it is reasonable to pause the new releases for four weeks, as you will have just 7.2 minutes of downtime before we breach the SLO.

4. Ensuring there is enough capacity for service delivery

A significant reliability factor is having adequate capacity (compute, memory, storage and network resources) to meet the demands of an application. Public cloud platforms with auto-scaling make capacity planning a bit easier, as resources are usually available on demand. However, cost is a significant factor if we don’t plan for capacity ahead of time.

By utilizing observability, you can measure application throughput, resource utilization, resource saturation, and anticipated usage of these resources. This information is invaluable to determining future capacity needs. Also, observability can show if we have over-provisioned resources, a major problem in public cloud platforms. By regularly monitoring resource utilization, we can avoid overspending.

5. Using monitoring to improve the CI/CD pipeline

Many organizations use some form of CI/CD (Continuous Integration/Continuous Deployment) to release their applications. But often, they don’t monitor the metrics around CI/CD itself. Measuring metrics such as release frequency, release duration, deployment errors, rollback duration, etc, can provide valuable insights to improve the CI/CD pipeline. This can be useful for identifying bottlenecks within the CI/CD pipeline.

6. Empowering the business with report generation

Many open-source and commercial observability platforms come with decent reporting and dashboard capabilities. Reports can be generated out of the box, but work is usually required (for example, writing a specific query) to create an easily digestible report. Reporting is effective when it can be automated and shared easily with relevant consumers. It is especially useful for executive leadership to drive investment decisions.

Below are some valuable reports:

Weekly SLO report
Weekly cost analysis report
Daily release performance report

Conclusion

By achieving observability of your applications, you can reduce your MTTR, gain deeper insights into how your applications run, monitor CI/CD pipelines, plan for capacity, and automate valuable reports. All of these help you achieve greater reliability, which in turn leads to improved customer satisfaction.

It may seem daunting to achieve end-to-end visibility from the user’s browser to your backend systems. However, it is possible to implement a good observability ecosystem within a short period of time. A great place to start is by defining your SLOs and implementing monitoring for the SLIs associated with them.

Want to learn more about SRE best practices?

Pluralsight offers a Fundamentals of Site Reliability Engineering (SRE) learning path that teaches you all about SRE and how to implement it. It covers a wide range of topics starting with the foundations of SME and how to incorporate it in your system design, to more advanced topics like how to manage SRE teams, and implementing effective incident response and change management.

If you liked this article, I’d highly recommend checking out my course, “Implementing Site Reliability Engineering (SRE) Reliability Best Practices.” Best of luck on your journey to implement observable, reliable systems!

Karun S.

Karun is passionate about IT operations. He has 20+ years of hands-on experience in diverse technologies ranging from Linux Administration to Cloud technologies, and everything in between. He specializes in modernizing IT operations with automation, end-to-end monitoring, CI/CD and containerization. He has helped numerous companies implement devops, CI/CD, Monitoring, Log aggregation and Cloud migrations (AWS and Azure).

More about this author