A Human Error Took Down AWS S3 US-EAST-1

A mistake took down more S3 servers than it should, including two subsystems essential to S3 operation. This resulted in S3 failure, affecting the S3 service and other services depending on it. Normal functioning was restored in about four hours.

Rarely Amazon AWS fails, but when it does the internet finds out about it and many are directly impacted. It was the case of the recent AWS S3 failure that took place on February 28th in the Northern Virginia (US-EAST-1) region. It was not just S3 functioning that was disrupted, but also a number of AWS services that depend on it, including EC2, EFS, API Gateway, Athena, Cloudsearch, MapReduce and others. These services either had an increased number of errors in functioning or simply did not work at all.

During the four hour disruption, it was reported that Expedia, GitLab, GitHub, GroupMe, IFTTT, Medium, Nest, Quora, Slack, The Verge, Trello, Twitch, Wix and many others were either taken offline or severely affected by the glitch. Even Amazon’s Alexa had trouble working and the AWS status dashboard was not updated for about two hours. Amazon relied on Twitter to inform about the problem and they added a temporary banner edited by hand on the status page.

Amazon published a post mortem analysis of the events. An Amazon team was debugging an S3 billing issue, and someone entered a command to remove a small number of servers from one of the S3 billing subsystems. That person entered the command incorrectly and took off a larger pool of S3 servers including two that played an important role in two other subsystems, one of them being the index subsystem for the entire S3 in the region affecting GET, PUT, LIST and DELETE commands. The second subsystem was for S3 placement, involved in the allocation of space for new objects. Because these two subsystems stopped working, S3 encountered a large number of errors in functioning affecting many customers.

While AWS has measures in place to quickly restore the functionality in case of failure of the respective subsystems, in this case the subsystems took considerably longer to reboot because they had not been restarted in years and the index grew quite large. The functionality was eventually restored but later than expected. Amazon had plans to partition the index subsystem some time this year, breaking up the index in smaller chunks that can be rebooted faster. Now, they will proceed with partitioning immediately to be better prepared for a future disruption. They have also changed the tools to limit the number of servers taken down in one command, and also avoid taking down an entire subsystem. They also distributed the dashboard, which depends on S3, across multiple regions to make sure it works even when one of the regions is down.

Back in 2011 Amazon experienced another disruption in the US East region, but that was a four day blackout. Then and now, the basic lesson to be learned and applied is to create systems that do not rely solely on one region, and instead be able to switch between regions if the current one is down. Netflix has such measures in place, and others could too. But it has an impact on cloud hosting costs, businesses preferring to lower them as much as possible. Amazon AWS is generally considered reliable, but services get disrupted from time to time. And that happens to any cloud provider.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter