BT

GitHub Engineering Adopts New Architecture for MySQL High Availability

| by Hrishikesh Barua Follow 14 Followers on Jul 08, 2018. Estimated reading time: 3 minutes |

GitHub.com uses MySQL as a backbone for many of its critical services like the API, authentication and the GitHub.com website itself. GitHub's engineering team replaced its previous DNS and Virtual IP (VIP)-based setup with one based on Orchestrator, Consul and the GitHub Load Balancer in order to get around split brain and DNS caching issues.

GitHub runs multiple MySQL clusters for different services and tasks, making it imperative to have them highly available (HA). GitHub's infrastructure is spread out across multiple datacenters, consisting of around 15 clusters, close to 150 production servers and 15 TB of MySQL tables. Each MySQL cluster has a single master, which responds to write requests, and multiple replicas, which serve read requests. The master node forms a single point of failure, and without it writes would completely fail. The HA requirements for this setup include auto-detection of failure, auto-promotion of a replica node to a master and auto-advertisement of the new master node to client applications.

GitHub's engineering team has employed several strategies for HA over the years, gradually moving towards uniformity across the organization. Since this is not restricted to MySQL, requirements for an HA solution also include cross-datacenter availability and split brain prevention. There are different possible approaches for MySQL master discovery. Previously, GitHub utilized DNS and VIP for discovery of the MySQL master node. The client applications would connect to a fixed hostname, which would be resolved by DNS to point to a VIP. A VIP allows traffic to be routed to different hosts to provide mobility without tying it down to a single host. The VIP would always be owned by the current master node. However, there were potential issues with the VIP acquire-and-release process during failover events, including split-brain situations. When this happens, two different hosts can have the same VIP and traffic can be routed to the wrong one. In addition, DNS changes have to occur to handle a master node that is in a different data center, and that can take time to propagate due to DNS caching at clients.

The latest setup at GitHub includes the Orchestrator toolkit, Consul for service discovery and the GitHub Load Balancer. In this architecture, when a client application looks up the master’s IP on DNS via its name, it is resolved via Anycast. The advantage of using Anycast is that while the name is resolved to the same IP address in every data center, the client traffic to that IP will be routed to the nearest master. The nearest master is the one that is co-located in the same data center. This routing is taken care of by GLB, which knows the current active MySQL master backends.

GitHub MySQL HA architecture
Image courtesy: https://GitHubengineering.com/mysql-high-availability-at-GitHub/

Orchestrator, also a GitHub engineering open source project, is responsible for master failure detection and the failover process. It utilizes collective knowledge drawn from all MySQL nodes including the replica to arrive at an informed decision about the master’s state. When a write master fails, the Orchestrator leader node detects the failure and starts the failover process to choose a new MySQL master. The rest of the Orchestrator cluster nodes notice this change and update their local Consul daemon with the new master details. Consul, a service discovery tool from HashiCorp, keeps track of the master nodes by storing them as key-value pairs. Consul can run in a distributed mode across datacenters but in GitHub's case each Consul cluster is independent at a datacenter level. The GLB gets notified of master status changes on a failover event using Consul Template, which queries the Consul clusters and updates the GLB state, which in turn routes traffic to the new master.

In the article, Shlomi Noach, senior infrastructure engineer at GitHub, mentions that although the new setup provides "between 10 to 13 seconds" of max outage time in most cases, there are some scenarios that need more work, like data center isolation leading to a split-brain or a Consul outage at the time of failover.  GitHub’s new setup is a move away from traditional techniques based on networking, to ones based on proxying and service discovery. It completely replaces the VIP-based one, but there is debate around whether it would have been easier to adopt a different approach utilizing the Border Gateway Protocol (BGP).

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Welcome to 2018 by Robert Van Dell II

Leader/follower is the modern preferred terminology over master/slave (e.g. "leader election" from failure detection).

Re: Welcome to 2018 by Daniel Bryant

Thanks for you comment Robert, and I understand your concern (I too personally prefer the modern terminology).

I've looked in the MySQL docs (dev.mysql.com/doc/refman/8.0/en/replication.html) and they do still use the older terminology, and so I can understand why Hrish chose the the original words for this news piece. However, the source article from GitHub used the terms "master-replica" and so I have updated this piece to reflect their choice.

Best wishes,

Daniel
InfoQ News Manager

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Educational Content

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT