Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News AWS US-EAST-1 Outage: Postmortem and Lessons Learned

AWS US-EAST-1 Outage: Postmortem and Lessons Learned

This item in japanese

On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia. The cloud provider released an analysis of the incident that started threads in the community about redundancy on AWS and multi-region approaches.

The outage began at 10:30 AM ET, impacting many customers (including Netflix, Disney+ and Delta airlines) and cascading through Amazon's retail operation, Alexa voice service and Ring security cameras. Recovery time varied according to the service, but the region was not fully operational until late in the day: the incident was the most significant in the Northern Virginia region for many years and lasted longer than the S3 disruption in 2017.

Affected services included CloudWatch, Gateway API, Secure Token Service (STS) and container services like Fargate, ECS and EKS. Already running instances and containers were unaffected but the failures of API requests did not allow customers to modify them or launch new ones. The cloud provider acknowledges that the incident "impacted many customers in significant ways" and explains:

An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.

During the outage the Service Health Dashboard was not up to date and users were not able to create support cases for many hours, with many joking about the real status of the region and the lack of updates. AWS justifies it with a failure of the monitoring systems and promises a revamp of the status page:

The impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region. (...) We expect to release a new version of our Service Health Dashboard early next year.

The incident triggered many threads and articles on redundancy, multi-region and multi-cloud approaches with the goal of minimizing the impact of future cloud outages. In "Lessons in Trust From us-east-1", Corey Quinn, cloud economist at The Duckbill Group, questions the various service interdependencies at AWS and warns:

You cannot have a multi-region failover strategy on AWS that features AWS’s us-east-1 region. Too many things apparently single-track through that region for you to be able to count on anything other than total control-plane failure when that region experiences a significant event. A clear example of this is Route 53’s impairment.

Jeremy Daly, author of the weekly serverless newsletter Off-by-none, believes that developers should not overreact:

What am I going to do about it? Probably nothing (...) There are some systems that simply can’t go down, and those systems most certainly should invest in redundancy, especially if human life is at risk. For the other 99.99% of us building workloads in the cloud, a multi-hour outage may sting, and perhaps even result in significant revenue losses, but compared to the cost of implementing and maintaining solutions to mitigate these outages (that only happen every few years), it’s a drop in the bucket. I’ve got more important things to focus on.

Zack Kanter, founder & CEO at Stedi, started a popular Twitter thread asking:

If AWS were rebuilt today, what high-level incidental complexity do you wish would be eliminated via different design decisions?

Replies showed that users are mostly concerned about spending limits, a better free tier and data sovereignty, with multiple regions in the same country.

On December 15th the cloud provider faced further but shorter connectivity problems in two US regions, us-west-1 and us-west-2.

Rate this Article