GitHub Incident Analysis Shows How to Improve Service Reliability

On October 21, 2018 GitHub users experienced a degraded service during 24 hours due to an incident that affected their MySQL topology, used to store GitHub metadata. The incident, caused by routine maintenance work, led to the display of outdated and inconsistent information and to the unavailability of webhooks and other internal services for 24 hours. GitHub post-incident report shows where things failed and suggests how to improve site reliability.

At the root of the incident, as GitHub engineer Jason Warner writes, was the replacement of failing 100GB optical equipment, which caused a short break in connectivity between GitHub US East Coast hub and the primary US East Coast data center. During those 43 seconds of lost connectivity, the US East Coast data center accepted writes that were not replicated to the US West Coast. At the same time, Orchestrator, which GitHub uses to manage MySQL cluster topologies and handle automated fail-over, detected the network partition and reorganized the network topology so all write traffic was redirected to the US West Coast data center.

When GitHub engineers understood what was happening, the two data centers had diverged and could not be easily synced to restore normal operations:

This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.

To complicate matters further, GitHub services would not work correctly and many applications would fail due to the increased cross-country latency. At that point, GitHub engineers decided to give priority to data confidentiality and integrity over service availability, so they paused web-hook delivery and Pages and took a longer path towards data recovery, which happened in three steps: restoring from backups, synchronizing replicas, and eventually resuming queued jobs.

Restoring from backups took several hours, mostly due to decompressing, check-summing, and loading multiple terabytes of data. When the restore was completed and the network topology stabilized, after about eight hours, GitHub performance started to improve, but there were still dozens of replicas holding old data, which caused many users to see inconsistent data:

We spread the read load across a large pool of read replicas and each request to our services had a good chance of hitting a read replica that was multiple hours delayed.

To speed up replicas catching-up with the latest data, GitHub engineers increased the number of replicas so the aggregate read traffic could be split more evenly across them.

A further issue that GitHub engineers had to tackle in the recovery phase was handling the increased load represented by the backlog of webhooks and Pages builds that had been queued, which amounted to over five million hook events and 80 thousand Pages builds. Eventually GitHub status went back to green after slightly more than 24 hours since the incident happened.

Throughout this whole process, GitHub engineers identified a number of technical initiatives that need to take place:

Preventing Orchestrator from promoting database primaries across region boundaries.

Leader-election within a region is generally safe, but the sudden introduction of cross-country latency was a major contributing factor during this incident.
Re-engineering GitHub data centers in a active/active/active design to support N+1 redundancy at the facility level. This should make it possible to tolerate full failure of a data center without user impact.
Further investments in fault injection and chaos engineering tooling.

As a final note, data restoration is not yet complete as of this writing. Due to the few seconds of US East Coast data that were not synced with the US West Coast data center, GitHub is currently analyzing their MySQL logs to understand which writes can be reconciled and which require getting in touch with affected users.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the Database topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter