Inherent Complexity of the Cloud: VS Online Outage Postmortem

On August 14^th, Visual Studio Online (VSO) experienced a 5.5 hour outage that occurred from approximately 14:00 UTC (10:00 EST) and ended just before 19:30 UTC (15:30 EST). Fixing the underlying problems that led to the outage took an additional 4 days, concluding on the following Sunday (August 17).

The exact cause has yet to be determined, but according to Microsoft’s Brian Harry, it appears that the outage is at least due in part to some license checks that had been improperly disabled, causing unnecessary traffic to be generated. Adding to the confusion (and possible causes) was the observation of “…a spike in latencies and failed deliveries of Service Bus messages”.

Core databases for the Shared Platform Services (SPS) that communicated over this Service Bus became overloaded with database updates. It was so overwhelmed that it began queuing requests which eventually led to blocked callers. SPS is used for both authentication and licensing verification so it could not be easily removed from the system. Harry did observe that in the interests of stability it may have been prudent to forgo some licensing checks, implying that there is sufficient granularity to separate authentication requests from the licensing requests.

At this time Harry believes that the outage was triggered by the accumulation of several bugs that individually may have been relatively harmless but in aggregate combined to form a cascading failure. Harry identified these 4 bugs as prime contributors:

Calls from TFS to SPS were incorrectly updating TFS properties, which in turn generated more messages from SPS to TFS in a negative feedback loop.
A bug in 401-handling was generating extra cache flushes.
A bug in an Azure Portal extension service was retrying 401 errors every 5 seconds (which compounded the effects of bug #2).
Invalidation events were being resent multiple times.

A couple secondary bugs were contributing to the problems by causing cache invalidations and unnecessary property updates which in turned generated additional SQL requests.

Harry feels that beyond the specific bugs described above, the conceptual problems that the team faced were due to unnecessary abstractions. Over reliance on abstractions caused developers to lose sight of the overall project architecture and as a result weren’t able to foresee how their changes were affecting the rest of the system. Combined with the lack of automated regression tests to detect changes in resource consumption from one build to the next, the trap was set for poor code enter the system without sufficient awareness of the impact it would have. Harry stressed going forward the need for the team to bulk up their testing, both in test environments and in controlled production situations.

Harry added in follow-up comments that the team is in the process of adding circuit breaker patterns. Commenter John Smith linked to the Circuit Breaker Pattern described on MSDN as well as the Hystrix project created and open sourced by Netflix.

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the Microsoft topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?