Expanded

Culturing Resiliency with Data: A Taxonomy of Outages

By Gremlin
This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types.
Course info
Level
Intermediate
Updated
Nov 30, 2020
Duration
29m
Table of contents
Culturing Resiliency with Data: A Taxonomy of Outages
Description
Course info
Level
Intermediate
Updated
Nov 30, 2020
Duration
29m
Your 10-day individual free trial includes:

Expanded library

This course and over 7,000+ additional courses from our full course library.

Hands-on library

Practice and apply knowledge faster in real-world scenarios with projects and interactive courses.
*Available on Premium only
Description

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices. Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.

About the author
About the author

Gremlin's enterprise Chaos Engineering platform makes it easy to build more reliable applications in order to prevent outages, innovate faster, and earn customer trust.

More from the author
Failing over without Falling over
Intermediate
21m
Nov 30, 2020
Automating Chaos Attacks at Expedia
Intermediate
24m
Nov 30, 2020
More courses by Gremlin