Several sub-regions of the Microsoft Windows Azure cloud were affected by the leap-year bug making some of their services unavailable for many customers for 12 to 24 hours.
According to Windows Azure Service Dashboard, a number of services were interrupted across several sub-regions or worldwide for more than 24 hours starting in the early hours of February 29th (UTC) and ending on March 1st, some time in the morning. The services affected were:
- Windows Azure Compute Service was partially down in 4 out of 6 sub-regions affecting from 6.7% of the hosted services in the North Central US sub-region to 28% in South Central US and 37% in North Europe, and impacting a number of other Azure services: Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal.
- The Service Bus was down in one region (South Central US) for more than 24 hours.
- The Marketplace located in South Central US was also partially affected for more than 12 hours, especially those services that required OAuth access.
- The Service Management service was affected worldwide in some case for about 12 hours and for other regions for more than 24 hours due to a certificate issue triggered on 2/29/2012.
The storage, CDN and other services seemed unaffected. Cause by a different problem, the Platform Management Portal was affected worldwide for about 3 hours on March 1st due to a “a backend setting which had been misconfigured “.
Bill Laing, Corporate VP Server and Cloud, briefly informed Azure customers on the outage and its cause. According to Laing, the Azure team became aware of the problem on Feb 28th at 5:45 PM PST which is Feb 29th 1:45 AM UTC. The culprit was a small software bug triggered by the special day February 29th:
The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year.
Symantec reported that the leap-year bug affected Software Delivery 6.1, and Alex Papadimoulis, a managing partner for Inedo, reported that some of their customers were hit by it. Also, a number of Point-of-Sale devices malfunctioned in New Zeeland.
While this bug is somewhat acceptable for smaller companies, it is an embarrassment for Microsoft, especially since it affected their cloud platform hosting customers’ services. It is interesting how small things can take down large computing platforms, as it happened to Amazon a year ago when the traffic in one availability zone in the US East Region was mistakenly shifted to a lower level router that could not handle the traffic and affected several EBS nodes, eventually taking down the entire zone. We will more like see more of these blackouts happening. To err is human, after all.