Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems. In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain. In this talk, we’ll explore the built-in tools and processes that AWS has in place to appropriately deal with failures in distributed serverless applications. We’ll discuss retry behaviors and strategies for dealing with errors in: Asynchronous Lambda function invocations (DLQs, retries, and throttling), event source mappings (Kinesis, SQS, and DynamoDB streams), step functions (task failures, transient issues, and fallback states), Lambda invocations from AWS services (synchronous and asynchronous), calls to AWS services (using the AWS SDK and other protocols), and third-party API calls (utilizing circuit breakers and other fallback methods). While this talk focuses on the AWS ecosystem, many of these strategies are adaptable to other cloud providers as well.