Course

Improving a Distributed System Post-Incident

In this session, we will dive into a case study of how a team can recover and improve a distributed system after a major incident.

Intermediate

33m

(4)

Created by Gremlin

Last Updated Dec 14, 2022

Get started today

Access this course and other top-rated tech content with one of our business plans.

Start a free team trial

Buy now

Try this course for free

Access this course and other top-rated tech content with one of our individual plans.

Start a free trial

Buy now

This course is included in the libraries shown below:

Core Tech

Course

Improving a Distributed System Post-Incident

In this session, we will dive into a case study of how a team can recover and improve a distributed system after a major incident.

Intermediate

33m

(4)

Created by Gremlin

Last Updated Dec 14, 2022

Get started today

Access this course and other top-rated tech content with one of our business plans.

Start a free team trial

Buy now

Try this course for free

Access this course and other top-rated tech content with one of our individual plans.

Start a free trial

Buy now

This course is included in the libraries shown below:

Core Tech

What you'll learn

In this session, we will dive into a case study of how a team can recover and improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems. This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had to reevaluate our system's performance & overhaul several keys areas of our codebase, as well as improve our monitoring, testing processes, database interactions, and reliability. Viewers will learn about these improvements and how they can apply them to their own systems to achieve greater reliability and performance. Additionally, viewers will learn how to effectively leverage monitoring practices to uncover inefficiencies in their system, tips for creating a testing process to properly stress your system before deploying to production, and how to rally a team together during a high-pressure incident.

Improving a Distributed System Post-Incident

Intermediate

33m

(4)

Table of contents

About the author

Gremlin

32 courses

3.7 author rating

18 ratings

Gremlin's enterprise Chaos Engineering platform makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn customer trust.

More Courses by Gremlin

Improving a Distributed System Post-Incident

Improving a Distributed System Post-Incident

Get started today

Try this course for free

Improving a Distributed System Post-Incident

What you'll learn

Improving a Distributed System Post-Incident

Improving a Distributed System Post-Incident 33m

2025 Forrester Wave™ names Pluralsight as a Leader among tech skills dev platforms