BT

Your opinion matters! Please fill in the InfoQ Survey!

Migrating GitHub's Web and API to Kubernetes Running on Bare Metal

| by Daniel Bryant Follow 220 Followers on Sep 12, 2017. Estimated reading time: 4 minutes |

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Over the last year GitHub has evolved their internal infrastructure that runs the Ruby on Rails application responsible for github.com and api.github.com to run on Kubernetes. The migration began with web and API applications running on Unicorn processes that were deployed onto Puppet-managed bare metal ("metal cloud") servers, and ended with all web and API requests being served by containers running in Kubernetes clusters deployed onto the metal cloud.

According to the GitHub engineering blog, the basic approach to deploying and running GitHub did not significantly change over the initial eight years of operation. However, GitHub itself changed dramatically, with new features, larger software communities, more GitHubbers on staff, and many more requests per second. As the organisation grew, the existing operational approach began to exhibit new problems: many teams wanted to extract the functionality into smaller services that could run and be deployed independently; and as the number of services increased, the SRE team found they were increasingly performing maintenance, which meant there was little time for enhancing the underlying platform. GitHub engineers needed a self-service platform they could use to experiment, deploy, and scale new services.

Several qualities of Kubernetes stood out from the other platforms that were initially evaluated: the vibrant open source community supporting the project; the first run experience that allowed deployment of a small cluster and an application in the first few hours of our initial experiment; and a "wealth of information available about the experience that motivated its design", particularly the acmqueue article "Borg, Omega, and Kubernetes".

At the earliest stages of this project, the GitHub team made a deliberate decision to target the migration of the critical web traffic workload. Many factors contributed to this decision, for example:

  • The deep knowledge of this application throughout GitHub would be useful during the process of migration.
  • The team wanted to make sure the habits and patterns we developed were suitable for large applications as well as smaller services.
  • Migrating a critical, high-visibility workload would encourage further Kubernetes adoption at GitHub.

Given the critical nature of the workload chosen to migrate, a high level of operational confidence was needed before serving any production traffic. Accordingly, a series of prototype Kubernetes "review lab" clusters were constructed. The end result was a chat-based interface for creating an isolated deployment of GitHub for any pull request. Labs are cleaned up one day after their last deploy, and as each lab is created in its own Kubernetes namespace, cleanup is as simple as deleting the namespace, which the deployment system performs automatically when necessary.

To satisfy the performance and reliability requirements of the flagship GitHub web service - which depends on low-latency access to other data services - a Kubernetes infrastructure was implemented on top of the metal cloud that was running in GitHub’s physical data centers and POPs. Many subprojects were involved in this effort, including: using container networking via the Project Calico network provider; following Kelsey Hightower’s Kubernetes the Hard Way tutorial; Puppetizing the configuration of Kubernetes nodes and Kubernetes apiservers; and enhancing GitHub’s internal load balancing service (GLB) to support Kubernetes NodePort Services.

GitHub running Kubernetes

After enhancing the GitHub deployment system to deploy a new set of Kubernetes resources to a github-production namespace in parallel with the existing production servers, and enhancing GLB to support routing staff requests to a different backend based on a feature-toggling Flipper-influenced cookie, staff were allowed to opt-in to the experimental Kubernetes backend with a button in the mission control bar. The load from internal users helped identify problems, fix bugs, and start getting comfortable with Kubernetes in production.

Several initial failure tests produced results that were not expected. Particularly, a test that simulated the failure of a single apiserver node disrupted the cluster in a way that negatively impacted the availability of running workloads. Given the observation of a Kubernetes cluster degrade in a way that might disrupt service, the web application now runs on multiple clusters in each physical site, and the process of diverting requests away from a unhealthy cluster to the other healthy ones has been fully automated.

The frontend transition was completed in little over a month while keeping performance and error rates within targets. During this migration, an issue was encountered that persists to this day: during times of high load and/or high rates of container churn, some of Kubernetes nodes will kernel panic and reboot. Although the SRE team is not satisfied with this situation, and is continuing to investigate it with high priority, they are happy that Kubernetes is able to route around these failures automatically and continue serving traffic within target error bounds.

The GitHub engineering team is "inspired by our experience migrating this application to Kubernetes", and while scope of the first migration was intentionally limited to stateless workloads, there is excitement about experimenting with patterns for running stateful services on Kubernetes, for example using StatefulSets.

Additional information on the GitHub adoption of Kubernetes can be found on the GitHub Engineering Blog.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Broken link to GitHub Engineering Blog by Chris Neave

The last paragraph contains a broken link to the GH eng blog. It should be githubengineering.com/kubernetes-at-github/.

Re: Broken link to GitHub Engineering Blog by Daniel Bryant

Many thanks for pointing this out Chris - this was completely my fault, with a cut and paste error from another document I linked to in the article.

I've updated the link as you've suggested.

Thanks again!

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT