Exploring the Windows Azure Outage
Microsoft's Azure cloud computing platform recently suffered a partial service outage due to a leap year bug. As the servers and software that form the service rolled over to 00:00 UST February 29, failures began to occur. The date change exposed a flaw in code that did not correctly account for the leap year, and as additional servers deployed, failures began to cascade throughout the Windows Azure cloud platform.
So what was the root cause of the failure? Laing explains that application virtual machines (VMs) use transfer certificates to facilitate secure communications between the application VM and the host operating system (OS). When transfer certificates are created they are designed to last for only one year. The original flawed code created an expiration date by taking the current date and adding 1 to the year field. Consquently, transfer certificates created on February 29, 2012 were given an expiration date of February 29, 2013, a date that does not exist. This error prevented new transfer certificates from being created, and the fuse was lit.
Laing concludes with several steps Microsoft is taking with respect to prevention, detection, response, and recovery. Microsoft is providing a 33% credit "to all customers of Windows Azure Compute, Access Control, Service Bus and Caching" regardless of whether or not their usage was impacted.
Windows Azure Outage Timeline (All times PST)
2012-02-28 16:00 Errors begin as UST is 00:00 February 29, 2012. New virtual machines are unable to generate proper certificates and as result self-terminate. After 25 minutes of silence, host OS restarts VM creation process, but this will also fail. Procedure calls for a total of 3 restart attempts, 25 minutes apart.
2012-02-28 17:15 Precisely 75 minutes (3 x 25 minutes) after failures begin, error threshold reached and the failing clusters notified human first responders that affected systems are unable to recover.
2012-02-28 18:38 Response team has identified the initial date/time bug causing the problems.
2012-02-28 18:55 Based on the team's assesment of the problem and to prevent customer's from causing additional damage, Microsoft disabled service management ability in all clusters worldwide.
2012-02-28 22:00 Response team created test and rollout plan for updated code.
2012-02-28 23:20 Updated code complete.
2012-02-29 01:50 Response team completed test rollout and application of updated code to a test cluster.
2012-02-29 02:11 Completed rollout of patch to one production cluster, at which point the team began deploying patch to all clusters.
2012-02-29 02:47 Seven clusters were left in a partially updated state as they were being updated when the original bug affected them. These seven received a separate update that was not tested before deployment, and this update introduced a separate bug which deactivated their network connectivity.
2012-02-29 03:30 A revised patch was made for the separate seven clusters, but this time tests were made before deployment. The revision was scheduled for installation at 05:40.
2012-02-29 05:23 Microsoft announces majority of customers have had their service management restored. (Excludes the separate seven clusters.)
2012-02-29 05:40 The separate seven clusters begin to receive their update to restore network functionality.
2012-02-29 08:00 The separate seven clusters become largely operational. Several individual servers have had errors introduced by the various outages, and Microsoft staff continued to work throughout the day to repair them.
2012-03-01 02:15 Full functionality to all clusters and services restored.