Amazon S3 Outage : Do SLAs Lead to Trust?
On AWS developer boards, the outage initially raised questions by some about whether AWS was reliable:
The s3 service is great but this just proves you can't rely on it, this is a major issue especially since it's been down for so long.Other users were quick to point to S3's long reliable track record:
This is the first outage I have experienced since I joined the service nearly a year ago.InfoQ interviewed a number of longtime S3 users and found a consistent story on S3's reliability. Over the past year there have been only one or two minor hiccups lasting less than two minutes.
Amazon offers a "99.9% Monthly Uptime Percentage" Service Level Agreement for S3. Amazon began offering the SLA in October. S3 is the only one of the eleven Amazon web services for which Amazon currently offers an SLA. What does Amazon's SLA mean for cloud-based storage solutions?
Perhaps not much. The S3 SLA commits to an average availability of 99.9% across all the 5 minute intervals in a month. The worst case that meets the SLA is a lack of availability for 40 minutes in a month. This is a couple of orders of magnitude away from the reliability expected by financial applications or medical devices, though missing Twits for a half hour would, for most people, be just a minor annoyance.
If the SLA is not met, Amazon provides a service credit that may not seem adequate to many S3 users considering that revenues and reputations are on the line. When the 99.9% service level agreement is not met, then Amazon credits 10% of the charges toward the next month. Amazon credits 25% of charges if the availability drops below 99.0%, which works out to, at best, a loss of almost 7 hours of service in a month. To put this in perspective, consider a user that stores 500 GB of data. The cost of moving 500 GB of data into S3 and serving it fully 10 times in a month would be around $1000. The refund to such a user for a 5 hour outage is $100, the amount such a users would expect for last week's outage. In this scenario, the credit for anywhere from 7 hours of downtime to a complete loss for the month is $250.
For most applications that need to leverage cloud computing resources, the SLA may not provide significant assurance. Amazon's reputation and track record of reliability is likely more important to most than the SLA in determining the appropriateness of S3 to any particular application.
Perhaps the anemic nature of SLAs in general may be why salesforce.com, considered by many to be the gold standard in SaaS computing, does not offer one. Salesforce builds trust in their service through the trust.salesforce.com website that provides sophisticated realtime information on the health of their services. Salesforce.com's health monitor was a reaction to a similar outage. Another significant measure of satisfaction with a service provider is how incidents are handled, since problems are expected under the best of circumstances. For example, Technorati received kudos for the way they handled scrambled blogs.
Amazon may be learning these lessons. The outage exposed a contrast between the effectiveness of Amazon's technical services, which appear capable to most customers, and their communication about the health of the system, which was a major pain point.
InfoQ interviewed an Amazon spokesperson about the outage. Amazon seemed to have a handle on the problem and was taking corrective action early on:
In one of our locations we started seeing elevated levels of authenticated requests from multiple users. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Within a short amount of time, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authentication requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location.Meanwhile, some users were frustrated by the lack of communication during the outage. Rien Swagerman, owner of Viewbook.com, told InfoQ:
What's quite amazing is that ... Amazon is giving very little status information when something like this happens. You have to dive deep in a forum somewhere to get some info. And this forum was down for posting [during] the first hour of the outage.Amazon's spokesperson told us that Amazon.com and their developer boards were affected by the outage. Amazon eats its own dog food, which is usually a good sign, but cloud computing may be changing the calculus.
In response to customer complaints about the level of communication, Amazon expects to release a service level dashboard "shortly." Cloud computing and SaaS technologies are still works in progress, and the S3 outage is clearly a growing pain. Ivo Beckers of FocusFriends.net said:
There's no other vendor yet that delivers the combination of these services for this quality and price. Actually, I'm happy that this happened ... it will challenge them to provide an even better service.Amazon is indeed going to be challenged in the burgeoning cloud computing market. Earlier this year, EMC launched EMC Fortress, a SaaS storage platform that is initially targeting backup, by leveraging their Mozy acquisition. This week, EMC announced that it hired Paul Maritz, a former Microsoft executive, to lead its new Cloud Infrastructure and Storage Division. EMC will likely be targeting a higher-end market segment than Amazon, providing more options on the price/reliability scale.
What can architects do to improve availability while keeping costs low? Many on Amazon's developer board lamented the fact that their website's reliability was entirely dependent on S3. Other users were less affected because they used S3 as a storage of record, but cached copies locally. InfoQ uses S3 as a back end store for videos, but keeps a local cache on an EC2 instance, so InfoQ.com was not affected by the outage. In addition to improving availability, local caches reduce costs by reducing the amount of data transfer directly from S3.
Are you using S3? What do you do to ensure availability?
Using S3 heavily...
But the net affect of a two hour outage for our business model wasn't such a big deal (well...especially since we were not officially launched at the time).
If you ask most start ups/microisvs/small businesses using S3, they may tell you the same thing: downtime sucks, but a two hour outage every year is a lot less downtime than you'd see if I was managing my own set of storage servers!
Granted, Amazon does need to do a better job of communicating downtime, but it looks like they'll be doing that soon now.
Whether it's cloud computing or not, developers need to assume that the resource, whatever it is, is going to suffer from downtime. If it hadn't been S3 itself, it may have been a network backbone instead, or one of a hundred things. The point is that external resources fail. Most developers know that - so if they were caught off guard that S3 could go down, they shouldn't have been. And I'd say for an external resource, S3 does a pretty good job at up time:).
www.sendalong.com - Send large files to anyone
Openness is an important part of building trust, but empathy comes first. Convince me that you know what it's like on my end, then I'm interested in hear about what happened. The next time this happens, report brutally frankly about what it was like for users, then explain yourselves.
Re: Missing empathy
Even with books like "Human SIGMA" which pretty much describes how a well run service organization should behave towards its customers, I don't think that a lot of companies will do what you asked. And the simple reason being that their legal departments won't allow it for fear of being sued.
Being honest opens the door to be sued unfortunately.
I just posted some thoughts on "Cloud Availability" at mukulblog.blogspot.com/2008/07/cloud-availabili... . Your thoughts are welcome.
Tiago Romero Garcia Mar 01, 2015
How Can We Use Our Creative Power and Technological Opportunity to Address the Challenges of the 21st Century?
Gyorgyi Galik Feb 26, 2015