Exploring the Windows Azure Outage

Microsoft's Azure cloud computing platform recently suffered a partial service outage due to a leap year bug. As the servers and software that form the service rolled over to 00:00 UST February 29, failures began to occur. The date change exposed a flaw in code that did not correctly account for the leap year, and as additional servers deployed, failures began to cascade throughout the Windows Azure cloud platform.

So what was the root cause of the failure? Laing explains that application virtual machines (VMs) use transfer certificates to facilitate secure communications between the application VM and the host operating system (OS). When transfer certificates are created they are designed to last for only one year. The original flawed code created an expiration date by taking the current date and adding 1 to the year field. Consquently, transfer certificates created on February 29, 2012 were given an expiration date of February 29, 2013, a date that does not exist. This error prevented new transfer certificates from being created, and the fuse was lit.

Laing concludes with several steps Microsoft is taking with respect to prevention, detection, response, and recovery. Microsoft is providing a 33% credit "to all customers of Windows Azure Compute, Access Control, Service Bus and Caching" regardless of whether or not their usage was impacted.

Windows Azure Service Dashboard

Windows Azure Service Dashboard

Windows Azure Outage Timeline (All times PST)

2012-02-28 16:00 Errors begin as UST is 00:00 February 29, 2012. New virtual machines are unable to generate proper certificates and as result self-terminate. After 25 minutes of silence, host OS restarts VM creation process, but this will also fail. Procedure calls for a total of 3 restart attempts, 25 minutes apart.

2012-02-28 17:15 Precisely 75 minutes (3 x 25 minutes) after failures begin, error threshold reached and the failing clusters notified human first responders that affected systems are unable to recover.

2012-02-28 18:38 Response team has identified the initial date/time bug causing the problems.

2012-02-28 18:55 Based on the team's assesment of the problem and to prevent customer's from causing additional damage, Microsoft disabled service management ability in all clusters worldwide.

2012-02-28 22:00 Response team created test and rollout plan for updated code.

2012-02-28 23:20 Updated code complete.

2012-02-29 01:50 Response team completed test rollout and application of updated code to a test cluster.

2012-02-29 02:11 Completed rollout of patch to one production cluster, at which point the team began deploying patch to all clusters.

2012-02-29 02:47 Seven clusters were left in a partially updated state as they were being updated when the original bug affected them. These seven received a separate update that was not tested before deployment, and this update introduced a separate bug which deactivated their network connectivity.

2012-02-29 03:30 A revised patch was made for the separate seven clusters, but this time tests were made before deployment. The revision was scheduled for installation at 05:40.

2012-02-29 05:23 Microsoft announces majority of customers have had their service management restored. (Excludes the separate seven clusters.)

2012-02-29 05:40 The separate seven clusters begin to receive their update to restore network functionality.

2012-02-29 08:00 The separate seven clusters become largely operational. Several individual servers have had errors introduced by the various outages, and Microsoft staff continued to work throughout the day to repair them.

2012-03-01 02:15 Full functionality to all clusters and services restored.

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Cell-Based Architecture Adoption Guidelines

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the Azure topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Cloudflare Introduces Workflows for Building Scalable Resilient Multi-Step Applications

Cloudflare Introduces Short-Lived SSH Access, Eliminating the Need for SSH Credentials

Microsoft Introduces Modern Web App Pattern for .NET: Accelerating App Modernization to the Cloud

Apache Tomcat 11.0 Delivers Support for Virtual Threads and Jakarta EE 11

AWS Lambda Introduces a Visual Studio Code-Based Editor with Advanced Features and AI Integration

Generally AI - Season 2 - Episode 5: Do Robots Dream of Electric Pianos?

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Steve Klabnik and Herb Sutter Talk about Rust and C++

Challenges and Lessons Porting Code from C to Rust

Grab Employs LLMs for Conversational Data Discovery with GPT-4, Glean and Slack

Cell-Based Architecture Adoption Guidelines

Software Architecture Tracks at QCon San Francisco 2024 – Navigating Current Challenges and Trends

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

What Developers Can Do to Continue to Program as They Age

How Rules Can Foster Creativity: The Design System of Reykjavík

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

OSI Releases New Definition for Open Source AI, Setting Standards for Transparency and Accessibility

Being a Responsible Developer in the Age of AI Hype

Optimizing Uber's Search Infrastructure: Upgrading to Apache Lucene 9.5

Improving the Efficiency of Goku Time-Series Database at Pinterest

Expedia Migrates a Massive Cassandra Cluster to ScyllaDB with Zero Downtime

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?