Amazon Web Services Stability and the September 13th US East 1 Outage
Amazon Web Services (AWS) suffered another outage of its US East 1 region during the morning of Friday 13th September. A number of popular applications such as Heroku, Github and CMSWire were disrupted along with many other customers in Amazon's largest, oldest and busiest location.
A few days before this most recent failure, cloud commentator Ben Kepes wrote, 'Every time AWS has an outage it seems to be the Eastern zone that brings the service down.' Kepes goes on to refer to a post from analyst René Büst that describes US East 1 as 'old, cheap and fragile'.
Amazon hasn't released a detailed post mortem, but the problems last Friday are attributed to networking issues. A previous outage in April 2011 was also network related, though more recent issues in December 2012 and October 2012 were traced back to problems with services such as Elastic Load Balancer (ELB) and Elastic Block Storage (EBS). Network and EBS failures have been particularly pernicious as they have caused disruption across availability zones (that are supposed to be fault boundaries) or brought down higher level services (like ELB) that are supposed to provide fault tolerance.
Typically application owners have used traditional architectures rather than designing for cloud and its inherent instability, with many applications failing to use multiple availability zones in a region, or multiple regions. Design for failure doesn't always save the day however. Netflix and its 'simian armychaos monkeys' is often paraded as a paragon of cloud ready design. They deliberately cause faults in their platform on a continuous basis to prove that it can keep working, but sometimes (such as the Christmas Eve outage) there just isn't enough capacity to absorb load elsewhere, and some customers are left with a degraded service.
The succession of outages in US East 1, and the failure of services that are supposed to help (like ELB) provides an opportunity for Amazon's competitors in the infrastructure as a service market. Google has recently released its own load balancing service for Google Compute Engine along with recommendations for designing robust systems.
Evolving Culture and Values. Understanding the Tradeoffs. Growth through Failure. The Importance of Leadership and Open Communication.
Pedram Keyani Mar 11, 2014
Summly: An Award Winning Mobile App's Journey to the Cloud with Five-9s Availability on a Shoestring Budget
Eugene Ciurana Mar 11, 2014
Christophe Achouiantz Mar 11, 2014