BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Amazon Web Services Stability and the September 13th US East 1 Outage

by Chris Swan on Sep 20, 2013 |

Amazon Web Services (AWS) suffered another outage of its US East 1 region during the morning of Friday 13th September. A number of popular applications such as Heroku, Github and CMSWire were disrupted along with many other customers in Amazon's largest, oldest and busiest location.

A few days before this most recent failure, cloud commentator Ben Kepes wrote, 'Every time AWS has an outage it seems to be the Eastern zone that brings the service down.' Kepes goes on to refer to a post from analyst René Büst that describes US East 1 as 'old, cheap and fragile'.

Amazon hasn't released a detailed post mortem, but the problems last Friday are attributed to networking issues. A previous outage in April 2011 was also network related, though more recent issues in December 2012 and October 2012 were traced back to problems with services such as Elastic Load Balancer (ELB) and Elastic Block Storage (EBS). Network and EBS failures have been particularly pernicious as they have caused disruption across availability zones (that are supposed to be fault boundaries) or brought down higher level services (like ELB) that are supposed to provide fault tolerance.

Typically application owners have used traditional architectures rather than designing for cloud and its inherent instability, with many applications failing to use multiple availability zones in a region, or multiple regions. Design for failure doesn't always save the day however. Netflix and its 'simian armychaos monkeys' is often paraded as a paragon of cloud ready design. They deliberately cause faults in their platform on a continuous basis to prove that it can keep working, but sometimes (such as the Christmas Eve outage) there just isn't enough capacity to absorb load elsewhere, and some customers are left with a degraded service.

The succession of outages in US East 1, and the failure of services that are supposed to help (like ELB) provides an opportunity for Amazon's competitors in the infrastructure as a service market. Google has recently released its own load balancing service for Google Compute Engine along with recommendations for designing robust systems

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT