BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Inherent Complexity of the Cloud: VS Online Outage Postmortem

by Jeff Martin on Aug 28, 2014 |

On August 14th, Visual Studio Online (VSO) experienced a 5.5 hour outage that occurred from approximately 14:00 UTC (10:00 EST) and ended just before 19:30 UTC (15:30 EST).  Fixing the underlying problems that led to the outage took an additional 4 days, concluding on the following Sunday (August 17).

The exact cause has yet to be determined, but according to Microsoft’s Brian Harry, it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated.  Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”.

Core databases for the Shared Platform Services (SPS) that communicated over this Service Bus became overloaded with database updates.  It was so overwhelmed that it began queuing requests which eventually led to blocked callers.  SPS is used for both authentication and licensing verification so it could not be easily removed from the system.  Harry did observe that in the interests of stability it may have been prudent to forgo some licensing checks, implying that there is sufficient granularity to separate authentication requests from the licensing requests.

At this time Harry believes that the outage was triggered by the accumulation of several bugs that individually may have been relatively harmless but in aggregate combined to form a cascading failure.  Harry identified these 4 bugs as prime contributors:

  1. Calls from TFS to SPS were incorrectly updating TFS properties, which in turn generated more messages from SPS to TFS in a negative feedback loop.
  2. A bug in 401-handling was generating extra cache flushes.
  3. A bug in an Azure Portal extension service was retrying 401 errors every 5 seconds (which compounded the effects of bug #2).
  4. Invalidation events were being resent multiple times.

A couple secondary bugs were contributing to the problems by causing cache invalidations and unnecessary property updates which in turned generated additional SQL requests.

Harry feels that beyond the specific bugs described above, the conceptual problems that the team faced were due to unnecessary abstractions.  Over reliance on abstractions caused developers to lose sight of the overall project architecture and as a result weren’t able to foresee how their changes were affecting the rest of the system.  Combined with the lack of automated regression tests to detect changes in resource consumption from one build to the next, the trap was set for poor code enter the system without sufficient awareness of the impact it would have.  Harry stressed going forward the need for the team to bulk up their testing, both in test environments and in controlled production situations.

Harry added in follow-up comments that the team is in the process of adding circuit breaker patterns.  Commenter John Smith linked to the Circuit Breaker Pattern described on MSDN as well as the Hystrix project created and open sourced by Netflix.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT