Chaos Conf 2020

Paths

Advanced

Chaos Conf 2020

Author: Gremlin

Chaos Conf 2020 was the largest Chaos Engineering event ever with 18 talks over 3 days and more than 3,500 people registered. That’s 5x more than 2019, and nearly 10x more than... Read more

What You Will Learn

  • Chaos Engineering
  • Incident Management
  • Reliability
  • DevOps

Pre-requisites

None.

Chaos Conf 2020 Sessions

Top 5 Things You Can Do to Reduce Operational Load

by Gremlin

Nov 24, 2020 / 27m

27m

Start Course
Description

With the world shifting to everything online, digital dependency and pressure is higher than ever. In March, PagerDuty saw incidents double across the board for its customers, with significant spikes in industries like online learning and ecommerce. The pressure isn't letting up, nor are customer expectations. Based on PagerDuty's data and conversations with thousands of customers, Rachel will talk about the easiest things you can do to make a big difference in reducing operational work from incidents. She'll also discuss ways to reduce duplicative efforts, surfacing issues, and improve response times to build more reliable teams.

Table of contents
  1. Top 5 Things You Can Do to Reduce Operational Load

The More You Know: A Guide to Understanding Your Systems

by Gremlin

Nov 24, 2020 / 24m

24m

Start Course
Description

As a platform provider, incidents and outages cost our customers money and it doesn't matter what your role is—developer, quality engineer, SRE, or even technical management—you must deliver trust. Delivering trust is accomplished by shipping secure and reliable systems. And you have to know your systems in order to do that. This talk will share how we developed a template that enables anyone at Twilio to understand their systems better, identify critical metrics to watch, and how to use Chaos Engineering to verify it all.

Table of contents
  1. The More You Know: A Guide to Understanding Your Systems

Stabilizing and Reinforcing H-E-B's Curbside Fulfillment Systems While Reinventing Them

by Gremlin

Nov 24, 2020 / 26m

26m

Start Course
Description

While going through the process of reinventing H-E-B's curbside and home delivery fulfillment systems, we had to spend significant effort to stabilize and reinforce the existing mission-critical systems to give us the cover needed to get to the finish line. It took a blend of utilizing new services as anti-corruption layers as well as addressing complex technical debt and performance issues to improve our uptime and reduce business impact. It also took using our newly developed chaos engineering mindset to get creative in introducing failure to validate our fixes.

Table of contents
  1. Stabilizing and Reinforcing H-E-B's Curbside Fulfillment Systems While Reinventing Them

Self-service Chaos Engineering: Fitting Gremlin into Grubhub's DevOps Culture

by Gremlin

Nov 24, 2020 / 21m

21m

Start Course
Description

In the era of DevOps and self-service culture, human processes are often harder than technical ones. Rolling out Gremlin to our infrastructure was easy, but enabling engineering teams to efficiently and safely practice Chaos Engineering was trickier. In this session, Doug Cambell will share how we rolled out Gremlin at Grubhub and how we educated and enabled all engineering teams to use it.

Table of contents
  1. Self-service Chaos Engineering: Fitting Gremlin into Grubhub's DevOps Culture

Scaling Culture of Resiliency in the Enterprise at Charter Communications

by Gremlin

Nov 24, 2020 / 22m

22m

Start Course
Description

How do you build a culture of reliability in a massive organization with well-established expectations of how to operate? A common assumption about enterprises is that everything moves at a glacial pace. After growing Charter’s product data engineering team from a handful of engineers to 30, the company implemented a large reorg. This new data platforms group quadrupled in size to over 120 engineers, and responsibility for a mission-critical services platform that backs Customer self-service digital applications and portals. This set of services needed to grow their reliability and Chaos Engineering practice. Nate Vogel, VP, Data Platforms, will share how he grew the data engineering team with an emphasis on building a culture of reliability. He’ll discuss the processes and tools his team used to ensure Charter and its customers have the data and analytics necessary to drive the business. Nate will also provide insight on how to share a culture of reliability in the face of sudden team expansion.

Table of contents
  1. Scaling Culture of Resiliency in the Enterprise at Charter Communications

Let Devs Be Devs: Abstracting Compliance and Reliability to Accelerate JPMCs Cloud Deployments

by Gremlin

Nov 24, 2020 / 31m

31m

Start Course
Description

Reliability is hard as complexity grows, and it makes shipping software difficult. The rigorous compliance requirements of the financial industry add additional challenges to developer velocity on modern cloud platforms. When you scale that up to an organization of JP Morgan Chase’s size with over 6500 apps and 50,000 engineers working across a global organization it can bring everything to a grinding halt.

In this session, Rahul Arya, Managing Director & Head of Global Technology Solutions Architecture at JPMC will share how they built a platform to abstract away compliance, make reliability with Chaos Engineering completely self-serve, and enable developers across the organization to ship code faster than ever.

Table of contents
  1. Let Devs Be Devs: Abstracting Compliance and Reliability to Accelerate JPMCs Cloud Deployments

Lessons from Incident Management and Postmortems at Atlassian

by Gremlin

Nov 24, 2020 / 27m

27m

Start Course
Description

How do you run incidents and postmortems at a company with thousands of engineers spread across the globe? Jim Severino shares what did and didn't work for Atlassian.

Table of contents
  1. Lessons from Incident Management and Postmortems at Atlassian

Lead Times and Psychological Safety within the Five Ideals

by Gremlin

Nov 24, 2020 / 29m

29m

Start Course
Description

The biggest challenges engineering organizations face are not technical. They’re fundamental problems with how we think and go about doing work, and the environments that we work in. In this talk, Gene Kim will share the Five Ideals and how they relate to Chaos Engineering. He’ll also show how the Five Ideals help build stronger, better performing, and ultimately more reliable companies.

Table of contents
  1. Lead Times and Psychological Safety within the Five Ideals

Identifying Hidden Dependencies

by Gremlin

Nov 24, 2020 / 20m

20m

Start Course
Description

You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose. We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.

Table of contents
  1. Identifying Hidden Dependencies

IBM's Principles of Chaos Engineering

by Gremlin

Nov 24, 2020 / 20m

20m

Start Course
Description

IBM has a long history of improving the reliability and availability of systems ranging from the largest of mainframes to the smallest of microservices. As part of cultural and organisational improvements we’ve sat down and codified a list of Chaos Engineering principles which define our view of Chaos Engineering. These principles do not replace existing principles, but adapt them and match them to the requirements we have from our clients and from our own internal services. In this session, we will describe a little of the process of getting engineers from across to agree on these principles and present the principles and lessons which we agreed upon.

Table of contents
  1. IBM's Principles of Chaos Engineering

Failing over without Falling over

by Gremlin

Nov 30, 2020 / 21m

21m

Start Course
Description

Many organizations have disaster recovery (DR) failover plans that are poorly tested and implemented, and they are scared to test or use them in a realistic manner. This talk will show how we can use System Theoretic Process Analysis (STPA), as advocated by Professor Nancy Leveson’s team at MIT, to analyze failover hazards. Observability and human understanding of safety margins and the state of a failover are critical to having a real DR capability. Chaos engineering, game days and a high level of automation provides continuously tested resilience, and confidence that systems will fail over, without falling over.

Table of contents
  1. Failing over without Falling over

Culturing Resiliency with Data: A Taxonomy of Outages

by Gremlin

Nov 30, 2020 / 29m

29m

Start Course
Description

This talk provides an overview of the categorization of outages that happened in Uber in the past few years based on root cause types. We'll start with some background information, including definitions, incident management framework, and existing preventive techniques, aka best practices. Followed by details and rationale around individual categories, sub-categories, and their relative distribution. Then we'll deep dive into two of the biggest categories: deployment and capacity with a focus on time series based data ming techniques to assist detection and simulation of some of the common root causes. Finally, we'll discuss the propagation of lessons learned in terms of policy and process changes based on these insights.

Table of contents
  1. Culturing Resiliency with Data: A Taxonomy of Outages

Convergence of Chaos Engineering and Revolutionized Technology Techniques

by Gremlin

Nov 30, 2020 / 24m

24m

Start Course
Description

Novel research areas such as the Internet of Things (IoT), Artificial Intelligence (AI), Cybersecurity, and Human Augmentation (HA) have demonstrated a big potential in the solution of specific problems. Medicine, Transportation, Software, Education, and Finances have been benefited by the progress of them. However, reaching this success requires assuming risks and failing many times to gain resilience. This journey involves terms and techniques that we study in Chaos Engineering, so in this talk, we are to explore how these emerging paradigms can use Chaos Engineering to manage the pains in the path toward providing a solution. On the other side, we will show how Chaos Engineering can benefit from Artificial Intelligence for example. Further, we are going to propose a conceptual model to explore the influence of these emerging paradigms over Chaos Engineering and How to use the Chaos Principles to identify risks, vulnerabilities, and generate resilience solutions.

Table of contents
  1. Convergence of Chaos Engineering and Revolutionized Technology Techniques

Chaos Engineering: The Path to Reliability

by Gremlin

Dec 9, 2020 / 26m

26m

Start Course
Description

We’re all here for the same purpose: to ensure the systems we build operate reliably. This is a difficult task, one that must balance people, process and technology during difficult conditions. We operate with incomplete information, assessing risks and dealing with emerging issues. We’ve found Chaos Engineering to be a valuable tool in addressing these concerns. Learn from real world examples what works, what doesn’t, and what the future holds.

Table of contents
  1. Chaos Engineering: The Path to Reliability

Certainty among the Chaos

by Gremlin

Nov 30, 2020 / 20m

20m

Start Course
Description

Chaos engineering tests your application resiliency by thoughtfully injecting failure and starving resources. Complete failure is obvious, but how do you detect the warning signs of pre-failure stress? This session takes the capabilities of chaos engineering beyond resiliency to support capacity optimization. You already need to monitor performance to see when your code is bending before it breaks. Why not glean more insight from the data so you can prioritize efforts and respond rapidly?

Table of contents
  1. Certainty among the Chaos

Can Chaos Coerce Clarity from Compounding Complexity? Certainly.

by Gremlin

Nov 30, 2020 / 25m

25m

Start Course
Description

Let's go Black Swan hunting together. This is a very different kind of hunting, and the tool we need is chaos. You see, the swans we're hunting aren't sitting in a tranquil pond or gliding majestically over a clear lake on a beautiful, sunny day. These swans are hiding in your products. They are hiding in your architecture, your infrastructure, and every dark-corner-turned-refuge created by layer upon layer of increasing system complexity. And these swans, these Black Swans, are not friendly or majestic creatures. They are wild maniacs, whose singular purpose is to watch your products burn. So suit up! Grab some coffee, put on something comfortable, and follow me, chaos tools in hand. Let's get some birds.

Table of contents
  1. Can Chaos Coerce Clarity from Compounding Complexity? Certainly.

Building a Reliable Community

by Gremlin

Nov 30, 2020 / 19m

19m

Start Course
Description

As Chaos Engineering transitions from early-adopters to becoming a main-stream practice, the Chaos Engineering community has continued to grow and expand. As we think about how to make our applications and organizations more reliable, we must also reflect on how to become a more reliable community. Strengthening the community helps us all build stronger Chaos Engineering practices, it accelerates the adoption of Chaos Engineering, and ultimately, it helps us all build a more reliable internet.

Table of contents
  1. Building a Reliable Community

Breaking Serverless Things on Purpose - Chaos Engineering in Stateless Environments

by Gremlin

Nov 30, 2020 / 24m

24m

Start Course
Description

Serverless enabled us to build highly distributed applications that led to more granular functions and ultimate scalability. However, it also brought the risk of failure from a single microservice to many serverless functions and resources. You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is Chaos Engineering: Breaking things on purpose just to experience how the whole system will react.

Join us as we walk through:

The unique challenges of building a highly resilient serverless app
Why you need to design for problems you cannot predict and cannot easily test for
How you can use plan your game days for chaos experiments with serverless components
How you can take advantage of out of the box and third-party observability solutions to measure the impact of chaos experiments.

Table of contents
  1. Breaking Serverless Things on Purpose - Chaos Engineering in Stateless Environments

Automating Chaos Attacks at Expedia

by Gremlin

Nov 30, 2020 / 24m

24m

Start Course
Description

In an effort to build resilience into our services, we at Hotels.com and Expedia Group explored processes and tools to stress and 'break' our systems on purpose. In this session, we will show you how to run attacks in both manual and automated ways. This includes attacks that run as part of the CI pipeline, ones that run randomly in production using automation, or even experiments with chaos-as-a-service platforms which can be used in GameDays.

Table of contents
  1. Automating Chaos Attacks at Expedia