Simple play icon Course
Skills

Culturing Resiliency with Data: A Taxonomy of Outages

by Gremlin

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types.

What you'll learn

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices. Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.

Table of contents

Culturing Resiliency with Data: A Taxonomy of Outages
29mins

About the author

Gremlin is a Chaos Engineering service on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss. Founded by CEO Kolton Andrus and CTO Matthew Fornaciari in 2016, the company has since raised $26.8Million in funding from Redpoint Ventures, Index Ventures, and Amplify Partners. Existi... more

Ready to upskill? Get started