Fundamentals of Site Reliability Engineering (SRE)

Paths

Fundamentals of Site Reliability Engineering (SRE)

Authors: Elton Stoneman, Wilvie Anora, Karun Subramanian

The DevOps ecosystem is a constantly evolving scene, and there’s any number of methodologies to choose from. To help you out, this series focuses on Sight Reliability Engineering... Read more

What you will learn

  • Understand the basic ideas behind Site Reliability Engineering
  • Design Service Level Indicators/Service Level Objectives/Error Budgets for a system
  • Design a basic structure for a vendor-agnostic monitoring and alerting system
  • Manage team toil levels
  • Implement effective incident response
  • Manage change via Site Reliability Engineering principals in a fast-moving organization
  • Structure an optimal Site Reliability Engineering function for an organization
  • Implement best practices for system reliability
  • Manage the human impact of working as a Site Reliability Engineering
  • Explain the benefits of using SRE

Pre-requisites

There are no hard prerequisites

Site Reliability Engineering (SRE): The Big Picture

Site Reliability Engineering (SRE): The Big Picture

by Elton Stoneman

Mar 5, 2020 / 1h 41m

1h 41m

Start Course
Description

Site Reliability Engineering (SRE) is a set of principles and practices that supports software delivery - keeping production systems stable and still delivering new features at speed. In this course, Site Reliability Engineering (SRE): The Big Picture, you'll get a thorough overview of how SRE works and why it's a good choice for many organisations. First, you'll learn the differences between SRE, DevOps, and traditional operations. Next, you'll discover how engineering practices help to reduce toil and provide more time to focus on high value tasks. Finally, you'll learn how SRE approaches monitoring and alerting, and about the SRE approach to managing incidents. When you're finished with this course, you'll be able to evaluate SRE and see if it's a good fit for your organisation.

Table of contents
  1. Course Overview
  2. Introducing Site Reliability Engineering
  3. Automation and Eliminating Toil
  4. Service Levels, Monitoring, and Alerting
  5. Incident Management: On-call and Postmortems

Incorporating Site Reliability Engineering (SRE) in Your System Design

Incorporating Site Reliability Engineering (SRE) in Your System Design

by Elton Stoneman

Nov 19, 2021 / 1h 37m

1h 37m

Start Course
Description

Before you adopt SRE you need to be sure that your systems are designed to work well with SRE practices. In this course, Incorporating Site Reliability Engineering (SRE) in Your System Design, you’ll learn how to design systems with SRE in mind and assess what's missing in your existing systems. First, you’ll discover how to architect apps for reliability, so temporary problems are automatically managed and bigger issues are quickly alerted. Next, you’ll explore how observability design supports SRE and helps you get your apps back online. Finally, you’ll delve into how to effectively measure and report on service levels. When you’re finished with this course, you’ll have the skills and knowledge of system design needed to bring your own apps into SRE.

Table of contents
  1. Course Overview
  2. Architecting Systems for Reliability
  3. Designing Observability for Fault Diagnosis
  4. Driving Continuous Improvement with Service Levels

Managing Teams for Site Reliability Engineering (SRE)

Managing Teams for Site Reliability Engineering (SRE)

by Wilvie Anora

Sep 20, 2021 / 2h 15m

2h 15m

Start Course
Description

Managers are faced with many challenges particularly in how to manage a team effectively and efficiently most especially if a particular function needs to be fulfilled for the organization such as that for Site Reliability Engineering (SRE). In this course, Managing Teams for Site Reliability Engineering (SRE), you’ll learn how to effectively and efficiently manage a Site Reliability Engineering (SRE) team that considers various aspects from human impact to structure. First, you’ll explore how you can manage the human impact of working in a Site Reliability Engineering (SRE) team through understanding psychological safety, managing loads, minimizing mental health impact and burnout. Next, you’ll discover how to manage team toil levels by first measuring then reducing it. Finally, you’ll learn how to structure an optimal Site Reliability Engineering (SRE) function for an organization of different sizes including designing the hiring pipeline and planning for career progression. When you’re finished with this course, you’ll have the skills and knowledge of managing teams for the Site Reliability Engineering (SRE) function which is needed to effectively and efficiently organize engineers and personnel who are part of this function.

Table of contents
  1. Course Overview
  2. Managing Human Impact in Site Reliability Engineering
  3. Managing Team Toil Levels
  4. Structuring an Optimal Site Reliability Engineering Team

Implementing Site Reliability Engineering (SRE) Reliability Best Practices

Implementing Site Reliability Engineering (SRE) Reliability Best Practices

by Karun Subramanian

Sep 13, 2021 / 1h 52m

1h 52m

Start Course
Description

Site Reliability Engineering is the implementation of efficient DevOps. In this course, Implementing Site Reliability Engineering (SRE) Reliability Best Practices, you’ll learn to implement Site Reliability Engineering best practices. First, you’ll explore managing incident response, which is a vital part of service management. Next, you’ll discover the steps to set up an efficient change management process. Finally, you’ll learn how to identify the best solutions for several common technical issues such as DNS, load balancing, health checks, and distributed consensus. When you’re finished with this course, you’ll have the skills and knowledge of Site Reliability Engineering needed to effectively manage your application or service.

Table of contents
  1. Course Overview
  2. Implementing Effective Incident Response
  3. Implementing Effective Change Management
  4. Implementing SRE Best Practices
  5. Benefits of SRE
Learning Paths

Fundamentals of Site Reliability Engineering (SRE)

  • Number of Courses4 courses
  • Duration7 hours

The DevOps ecosystem is a constantly evolving scene, and there’s any number of methodologies to choose from. To help you out, this series focuses on Sight Reliability Engineering and how it can help scale and evolve your various DevOps processes.

Courses in this path

Incorporating Site Reliability Engineering (SRE) in Your System Design

Managing Teams for Site Reliability Engineering (SRE)

Implementing Site Reliability Engineering (SRE) Reliability Best Practices

Join our learners and upskill
in leading technologies