Failover Conf 2020

Paths

Failover Conf 2020

Author: Gremlin

Being a resilient engineer means building systems that are hardened against the expected failures and resilient enough to withstand the unexpected ones. This year we expected the... Read more

What You Will Learn

  • Resilience Engineering
  • DevOps
  • Reliability Engineering Practices

Pre-requisites

None.

Failover Conf 2020 Sessions

Reliability Matters More Than Ever

by Gremlin

Nov 6, 2020 / 26m

26m

Start Course
Description

Chaos and uncertainty are all around us. Tammy Butow kicks off Failover Conf by sharing why reliability and resilience matter now more than ever — and how you can achieve it.

Table of contents
  1. Reliability Matters More Than Ever

Human-in-the-Loop DevOps

by Gremlin

Nov 6, 2020 / 27m

27m

Start Course
Description

Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks. In this talk, we'll discuss the spectrum of automation in DevOps, common patterns of tasks that can be automated away, like CI/CD and monitoring, and ones that can be partially automated with Human-in-the-Loop DevOps, like incident response. We'll share examples of interfaces that pull humans into the loop at critical junctures and allow humans to add maximal value while automating the tedium. Lastly, we'll discuss how Human-in-the-Loop DevOps can improve the on-call experience and improve efficiency.

Table of contents
  1. Human-in-the-Loop DevOps

Slowdown Is the New Outage

by Gremlin

Nov 6, 2020 / 31m

31m

Start Course
Description

While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss. This session compares slowdowns vs. outages and the need for insight more than observability. By understanding the difference, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

Table of contents
  1. Slowdown is the New Outage

Pitfalls in Measuring SLOs

by Gremlin

Nov 6, 2020 / 28m

28m

Start Course
Description

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure. As an SLO advocate and a design researcher, we collected user feedback through iterative deployments to learn what challenges users were running into. This conversation will discuss how we iterated our design, based on user feedback; how we deployed, what we learned, and re-deployed; and how we collected information from our users and from the alerts our system fired. In this talk, we will discuss how we brought the theory of SLOs to practice, and what we learned that we hadn’t expected in the process. We’ll discuss implementing the SLO feature and burn alerts; and our experiences from working with the SRE team who started using the alerts. Our hope is that when you buy or build your SLO tools, you’ll know what to look for, and how to get started. implementors will be able to start with a more solid ground, and that we will be able to advance the state of SLO support for all teams that wish to implement them. The major design points will be broken into a discussion of what we actually built; a number of unexpected technical features; and ways that we had to educate users beyond the standard SLO guidelines. The talk is largely conceptual: no live code will be shown, although some innocent servers may well die in the process of being visualized.

Table of contents
  1. Pitfalls in Measuring SLOs

How to Fail with Serverless

by Gremlin

Nov 6, 2020 / 30m

30m

Start Course
Description

Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems. In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain. In this talk, we’ll explore the built-in tools and processes that AWS has in place to appropriately deal with failures in distributed serverless applications. We’ll discuss retry behaviors and strategies for dealing with errors in: Asynchronous Lambda function invocations (DLQs, retries, and throttling), event source mappings (Kinesis, SQS, and DynamoDB streams), step functions (task failures, transient issues, and fallback states), Lambda invocations from AWS services (synchronous and asynchronous), calls to AWS services (using the AWS SDK and other protocols), and third-party API calls (utilizing circuit breakers and other fallback methods). While this talk focuses on the AWS ecosystem, many of these strategies are adaptable to other cloud providers as well.

Table of contents
  1. How to Fail with Serverless

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation

by Gremlin

Nov 6, 2020 / 29m

29m

Start Course
Description

Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples. Risk Reduction is trying to make sure bad things happen as rarely as possible. It's anti-lock brakes and vaccinations and irons that turn off by themselves and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often. Harm Mitigation is what we do so that when bad things do happen, they are less catastrophic. Building fire sprinklers and seatbelts and needle exchanges are all about making the consequences of something bad less terrible. This talk is focused on understanding where we can prevent problems and where we can just make them less bad, and what kinds of tools we can use to make every disaster a disappointing fizzle. Audiences will leave with a clearer understanding of risk and harm, and a set of tools than can be used to minimize future problems. I'm going to talk about why we need to understand both avoiding problems and making them less catastrophic, and what kinds of tools are appropriate to each. I think that developers need to be thinking about failure states more than we currently do. We talk about avoiding them, or testing them away, but we don't talk about how to make even failure a better experience.

Table of contents
  1. Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation

Built-in Application Resiliency

by Gremlin

Nov 6, 2020 / 29m

29m

Start Course
Description

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, and there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to withstand impact of failures.

Table of contents
  1. Built-in Application Resiliency

Fight, Flight, or Freeze

by Gremlin

Nov 6, 2020 / 24m

24m

Start Course
Description

In this talk, Matt Stratton will explain the background of fight, flight, and freeze, and how it applies to organizations. Based on personal experiences with post-traumatic stress (PTS), Matt will give examples and suggestions on how to identify your own organizational trauma and how to help heal it.

Table of contents
  1. Fight, Flight, or Freeze

Swim Don’t Sink: Why Reliability Matters to a Site Reliability Engineering Practice

by Gremlin

Nov 6, 2020 / 29m

29m

Start Course
Description

Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place. The specific training needs of each engineer varies depending on several factors including: The maturity of your organization in adopting DevOps / SRE principles, practices, and culture, the knowledge those individuals have about your organization and infrastructure, and the experience of the individuals being trained, both in terms of technical skill and familiarity with the SRE / DevOps model. This talk will explore the business case for training, the trade-offs between cost and effectiveness, and best practices for training design and deployment depending on where your organization lies on the spectrum of size and maturity. Learn why training is not about unleashing a fire hose of information upon unsuspecting engineers but about giving those engineers the confidence to run production systems at scale.

Table of contents
  1. Swim Don’t Sink: Why Reliability Matters to a Site Reliability Engineering Practice

Performing Chaos in a Serverless World

by Gremlin

Nov 6, 2020 / 30m

30m

Start Course
Description

Chaos engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior. The principles of chaos engineering have been around for years, and we have now reached the point where chaos engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries. Planning and performing chaos experiments on traditional infrastructure with virtual machines and microservices using containers has been battle-tested by many large organizations, but serverless functions and managed services present different failure modes and level of abstraction. In this talk, we focus on how to apply the principles of chaos engineering to serverless, both for serverless functions and managed services. This covers how hypothesis can be formed to fit serverless, what the experiments can achieve and how to practically perform them. With tools for chaos engineering, both commercial and open-source, getting more mature most of them still have focus primarily on virtual machines and containers. We’ll look at what tools are out there to help with chaos experiments for serverless and managed services, but also how you can build your own. Join as we move from talking about the principles to performing real chaos in a serverless world!

Table of contents
  1. Performing Chaos in a Serverless World

The Future of DevOps Is Resilience Engineering

by Gremlin

Nov 6, 2020 / 30m

30m

Start Course
Description

For more than a decade, many of us have been working to bring DevOps to organizations around the world. We’ve made amazing progress, but there’s so much more to do. Now that we have continuous integration & deployment widespread and developers are taking more ownership of production, what’s next? Amy will talk about what Resilience Engineering is, how it relates to DevOps, and how she thinks it gives us the science and research we need to take our organizations to the next level of robustness while remaining agile and growing our ability to care for the people around us.

Table of contents
  1. The Future of DevOps is Resilience Engineering

Improving a Distributed System Post-Incident

by Gremlin

Nov 6, 2020 / 33m

33m

Start Course
Description

In this session, we will dive into a case study of how a team can recover and improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems. This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had to reevaluate our system's performance & overhaul several keys areas of our codebase, as well as improve our monitoring, testing processes, database interactions, and reliability. Viewers will learn about these improvements and how they can apply them to their own systems to achieve greater reliability and performance. Additionally, viewers will learn how to effectively leverage monitoring practices to uncover inefficiencies in their system, tips for creating a testing process to properly stress your system before deploying to production, and how to rally a team together during a high-pressure incident.

Table of contents
  1. Improving a Distributed System Post-Incident

The Halo of Resilience Engineering

by Gremlin

Nov 6, 2020 / 32m

32m

Start Course
Description

Recent world-impacting events have caused us all to have to rethink the way we go about our daily work. In this talk, we'll look at how some of the pillars of Resilience Engineering might help you and your team deal with the changes we're all being forced to confront.

Table of contents
  1. The Halo of Resilience Engineering