Avoiding Downtime When Cloud Services Fail
Another AWS outage hit several large websites and their services last week. What can be done to avoid downtime? Architect for failover not just for scale.
According to the AWS Service Health Dashboard a number of cloud services –EC2, EMR, RDS, Elastic Beanstalk- running on Amazon East Region Availability Zone (North Virginia) were down from approximately 7:25 PM PDT/June 29th to 3 PM PDT/June 30th, affecting many companies, their services and websites, including Netflix, Instagram, Pinterest, and Heroku. According to an Amazon report, the main culprit was a power event “triggered during a large scale electrical storm which swept throughout the Northern Virginia area” during a violent storm that hit Eastern US that Friday night and killed 13 people, leaving 3 million without power.
Besides utility power fluctuations, a large voltage spike hit two datacenters forcing them to switch to power generators, but one of them failed to run. Its failure led to a power outage in the respective datacenter when the UPSes supporting the servers were depleted. The power was restored soon but restoring the services to full functionality took much longer. Mike Kavis, VP of Architecture at Inmar, explained why it took so long: “The truth is that Amazon’s backup power sources kicked in but not all compute resources failed over successfully. The impact was that a subset of virtual servers was knocked offline for a period of time until AWS was able to restore them. How a customer deals with that use case determines whether their applications go down or stay resilient. Many sites went down.”
Taking account of such events is important in order to understand what is needed to avoid downtime. Kavis said in a blog post following the AWS power event that his company has gone through 5 AWS outages since 2009, but their service was never down. The explanation: Using multiple zones and regions:
Amazon has a 99.95% SLA for each zone within each region. They have never had multiple zones down within a region and have never had multiple regions down at the same time. In essence, they have provided us with 100% uptime for compute resources. It is up to us to architect a system to take advantage of multiple zones and regions. …
We expect every server and every service in our platform to fail at some point and design for ways to continue to process transactions on redundant compute resources in multiple zones. In other words, we expect zones within regions to fail and design our platform to not be dependent on a single zone.
But multiple zones is not enough, said Kavis:
One pattern I have noticed with these outages is that AWS’s RDS service, a service for automating database administration processes, seems to always go down when Amazon has issues….
The fact that we manually manage our MySQL databases has been one of the many reasons why we have stayed resilient during these outages. Had we been reliant on RDS, we may not have been so lucky. Does that mean AWS customers should not use RDS? No. We may still use it for certain features that do not require extremely high Service Level Agreements (SLA) and provide real time connectivity to Point-of-sale systems.
Kavis remarked that some of the companies hit by the last week’s AWS outage have “some of the most impressive and advanced architectures ever seen in a high scaling environment.” But they traded uptime for scaling:
A free social media site who has no SLAs to meet may choose to invest more time in scaling to millions of concurrent users and risk going down for an hour or two. It may be a better investment for them to handle surges in traffic than to focus on the rare event of AWS having an outage. Nobody ever died when they could not post a picture to Facebook.
The conclusion is: if one needs to be available all the time, he needs to architect for failover, not just for scaling, as Kavis said: “What we need to understand is that many companies in the Virginia area who built their own datacenters were down too and some still are. Power outages happen. Data centers fail both in the cloud and on-premise. Everything fails eventually. The secret to uptime is how you design for these failures.”
InfoQ Sep 01, 2015