Inherent Complexity of the Cloud: VS Online Outage Postmortem
On August 14th, Visual Studio Online (VSO) experienced a 5.5 hour outage that occurred from approximately 14:00 UTC (10:00 EST) and ended just before 19:30 UTC (15:30 EST). Fixing the underlying problems that led to the outage took an additional 4 days, concluding on the following Sunday (August 17).
The exact cause has yet to be determined, but according to Microsoft’s Brian Harry, it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated. Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”.
Core databases for the Shared Platform Services (SPS) that communicated over this Service Bus became overloaded with database updates. It was so overwhelmed that it began queuing requests which eventually led to blocked callers. SPS is used for both authentication and licensing verification so it could not be easily removed from the system. Harry did observe that in the interests of stability it may have been prudent to forgo some licensing checks, implying that there is sufficient granularity to separate authentication requests from the licensing requests.
At this time Harry believes that the outage was triggered by the accumulation of several bugs that individually may have been relatively harmless but in aggregate combined to form a cascading failure. Harry identified these 4 bugs as prime contributors:
- Calls from TFS to SPS were incorrectly updating TFS properties, which in turn generated more messages from SPS to TFS in a negative feedback loop.
- A bug in 401-handling was generating extra cache flushes.
- A bug in an Azure Portal extension service was retrying 401 errors every 5 seconds (which compounded the effects of bug #2).
- Invalidation events were being resent multiple times.
A couple secondary bugs were contributing to the problems by causing cache invalidations and unnecessary property updates which in turned generated additional SQL requests.
Harry feels that beyond the specific bugs described above, the conceptual problems that the team faced were due to unnecessary abstractions. Over reliance on abstractions caused developers to lose sight of the overall project architecture and as a result weren’t able to foresee how their changes were affecting the rest of the system. Combined with the lack of automated regression tests to detect changes in resource consumption from one build to the next, the trap was set for poor code enter the system without sufficient awareness of the impact it would have. Harry stressed going forward the need for the team to bulk up their testing, both in test environments and in controlled production situations.
Harry added in follow-up comments that the team is in the process of adding circuit breaker patterns. Commenter John Smith linked to the Circuit Breaker Pattern described on MSDN as well as the Hystrix project created and open sourced by Netflix.
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin May 21, 2015
Kai Kreuzer, Olaf Weinmann May 21, 2015