Amazon S3 Outage : Do SLAs Lead to Trust?

Amazon Web Services' Simple Storage Service (S3), a cloud-based storage platform used by many popular websites including Twitter, G.ho.st, and 37signals' Basecamp, suffered a major outage last week. The outage occurred in one of S3's three geographical sites and lasted a little over two hours.

On AWS developer boards, the outage initially raised questions by some about whether AWS was reliable:

The s3 service is great but this just proves you can't rely on it, this is a major issue especially since it's been down for so long.

Other users were quick to point to S3's long reliable track record:

This is the first outage I have experienced since I joined the service nearly a year ago.

InfoQ interviewed a number of longtime S3 users and found a consistent story on S3's reliability. Over the past year there have been only one or two minor hiccups lasting less than two minutes.

Amazon offers a "99.9% Monthly Uptime Percentage" Service Level Agreement for S3. Amazon began offering the SLA in October. S3 is the only one of the eleven Amazon web services for which Amazon currently offers an SLA. What does Amazon's SLA mean for cloud-based storage solutions?

Perhaps not much. The S3 SLA commits to an average availability of 99.9% across all the 5 minute intervals in a month. The worst case that meets the SLA is a lack of availability for 40 minutes in a month. This is a couple of orders of magnitude away from the reliability expected by financial applications or medical devices, though missing Twits for a half hour would, for most people, be just a minor annoyance.

If the SLA is not met, Amazon provides a service credit that may not seem adequate to many S3 users considering that revenues and reputations are on the line. When the 99.9% service level agreement is not met, then Amazon credits 10% of the charges toward the next month. Amazon credits 25% of charges if the availability drops below 99.0%, which works out to, at best, a loss of almost 7 hours of service in a month. To put this in perspective, consider a user that stores 500 GB of data. The cost of moving 500 GB of data into S3 and serving it fully 10 times in a month would be around $1000. The refund to such a user for a 5 hour outage is $100, the amount such a users would expect for last week's outage. In this scenario, the credit for anywhere from 7 hours of downtime to a complete loss for the month is $250.

For most applications that need to leverage cloud computing resources, the SLA may not provide significant assurance. Amazon's reputation and track record of reliability is likely more important to most than the SLA in determining the appropriateness of S3 to any particular application.

Perhaps the anemic nature of SLAs in general may be why salesforce.com, considered by many to be the gold standard in SaaS computing, does not offer one. Salesforce builds trust in their service through the trust.salesforce.com website that provides sophisticated realtime information on the health of their services. Salesforce.com's health monitor was a reaction to a similar outage. Another significant measure of satisfaction with a service provider is how incidents are handled, since problems are expected under the best of circumstances. For example, Technorati received kudos for the way they handled scrambled blogs.

Amazon may be learning these lessons. The outage exposed a contrast between the effectiveness of Amazon's technical services, which appear capable to most customers, and their communication about the health of the system, which was a major pain point.

InfoQ interviewed an Amazon spokesperson about the outage. Amazon seemed to have a handle on the problem and was taking corrective action early on:

In one of our locations we started seeing elevated levels of authenticated requests from multiple users. While we carefully monitor our overall request volumes and these remained within normal ranges, we had not been monitoring the proportion of authenticated requests. Importantly, these cryptographic requests consume more resources per call than other request types. Within a short amount of time, we began to see several other users significantly increase their volume of authenticated calls. The last of these pushed the authentication service over its maximum capacity before we could complete putting new capacity in place. In addition to processing authentication requests, the authentication service also performs account validation on every request Amazon S3 handles. This caused Amazon S3 to be unable to process any requests in that location.

Meanwhile, some users were frustrated by the lack of communication during the outage. Rien Swagerman, owner of Viewbook.com, told InfoQ:

What's quite amazing is that ... Amazon is giving very little status information when something like this happens. You have to dive deep in a forum somewhere to get some info. And this forum was down for posting [during] the first hour of the outage.

Amazon's spokesperson told us that Amazon.com and their developer boards were affected by the outage. Amazon eats its own dog food, which is usually a good sign, but cloud computing may be changing the calculus.

In response to customer complaints about the level of communication, Amazon expects to release a service level dashboard "shortly." Cloud computing and SaaS technologies are still works in progress, and the S3 outage is clearly a growing pain. Ivo Beckers of FocusFriends.net said:

There's no other vendor yet that delivers the combination of these services for this quality and price. Actually, I'm happy that this happened ... it will challenge them to provide an even better service.

Amazon is indeed going to be challenged in the burgeoning cloud computing market. Earlier this year, EMC launched EMC Fortress, a SaaS storage platform that is initially targeting backup, by leveraging their Mozy acquisition. This week, EMC announced that it hired Paul Maritz, a former Microsoft executive, to lead its new Cloud Infrastructure and Storage Division. EMC will likely be targeting a higher-end market segment than Amazon, providing more options on the price/reliability scale.

What can architects do to improve availability while keeping costs low? Many on Amazon's developer board lamented the fact that their website's reliability was entirely dependent on S3. Other users were less affected because they used S3 as a storage of record, but cached copies locally. InfoQ uses S3 as a back end store for videos, but keeps a local cache on an EC2 instance, so InfoQ.com was not affected by the outage. In addition to improving availability, local caches reduce costs by reducing the amount of data transfer directly from S3.

Are you using S3? What do you do to ensure availability?

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter