👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. If you’ve been forwarded this email, you can subscribe here. How AWS deals with a major outageWhat happens when there’s a massive outage at AWS? A member of AWS’s Incident Response team lifts the lid, after playing a key role in resolving the leading cloud provider’s most recent major outageIn October, the largest Amazon Web Services (AWS) region in the world suffered an outage lasting 15 hours, which created a global impact as thousands of sites and apps crashed or degraded – including Amazon.com, Signal, Snapchat, and others. AWS released an incident summary three days later, revealing the outage in us-east-1 was started by a failure inside DynamoDB’s DNS system, which then spread to Amazon EC2 and to AWS’s Network Load Balancer. The incident summary overlooked questions such as why it took so long to resolve, and some media coverage sought to fill the gap. The Register claimed that an “Amazon brain drain finally sent AWS down the spout”, because some AWS staff who knew the systems inside out had quit the company, and their institutional knowledge was sorely missed. For more clarity and detail, I went to an internal source at Amazon: Senior Principal Engineer, Gavin McCullagh, who was part of the crew which resolved this outage from start to finish. In this article, Gavin shares his insider perspective and some new details about what happened, and we find out how incident response works at the company. This article is based on Gavin’s account of the incident to me. We cover:
Spoiler alert: this outage was not caused by a brain drain. In fact, many engineers who originally built the service, DNS Enactor (responsible for updating routes in Amazon’s DNS service) ~3 years ago, are very much still at AWS, and five of them hopped onto the outage call in the dead of night, which likely helped to resolve the outage more quickly. As it turns out – and as readers of this newsletter likely already know! – operating distributed systems is simply hard, and it’s even harder when several things go wrong at once. Note: if you work at Amazon, get full access to The Pragmatic Engineer with your corporate email here. It includes deepdives like Inside Amazon’s Engineering Culture, ones on Meta, Stripe, and Google, and a wide variety of engineering culture deepdives. The bottom of this article could be cut off in some email clients. Read the full article uninterrupted, online. 1. Incident Response team at AWSAmazon is made up of many different businesses of which the best known are:
These organizations operate pretty much independently with their own processes and set ups. Processes are similar but not identical, and both groups evolve how they work over time, separately. In this deepdive, our sole focus is AWS, but it could be assumed Retail has some similar functions, like separate Incident Response teams. Regions and AZs: AWS is made up of 38 cloud regions. Each one consists of at least 3 Availability Zones (AZs), which are at least one independent data center, connected via a low-latency network. Note: “Region” and “AZ” mean slightly different things among cloud providers, as covered in Three cloud providers, three outages, three different responses. |