Expanded Library

Failing over without Falling over

by Gremlin

This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards.

What you'll learn

Many organizations have disaster recovery (DR) failover plans that are poorly tested and implemented, and they are scared to test or use them in a realistic manner. This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards. Observability and human understanding of safety margins and the state of a failover are critical to having a real DR capability. Chaos engineering, game days and a high level of automation provides continuously tested resilience, and confidence that systems will fail over, without falling over.

Table of contents

Failing over without Falling over
22mins

About the author

Gremlin is a Chaos Engineering service on a mission to help build a more reliable internet. Their solutions turn failure into resilience by offering engineers a fully hosted SaaS platform to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss. Founded by CEO Kolton Andrus and CTO Matthew Fornaciari in 2016, the company has since raised $26.8Million in funding from Redpoint Ventures, Index Ventures, and Amplify Partners. Existi... more

Ready to upskill? Get started