Migrating GitHub's Web and API to Kubernetes Running on Bare Metal

Over the last year GitHub has evolved their internal infrastructure that runs the Ruby on Rails application responsible for github.com and api.github.com to run on Kubernetes. The migration began with web and API applications running on Unicorn processes that were deployed onto Puppet-managed bare metal ("metal cloud") servers, and ended with all web and API requests being served by containers running in Kubernetes clusters deployed onto the metal cloud.

According to the GitHub engineering blog, the basic approach to deploying and running GitHub did not significantly change over the initial eight years of operation. However, GitHub itself changed dramatically, with new features, larger software communities, more GitHubbers on staff, and many more requests per second. As the organisation grew, the existing operational approach began to exhibit new problems: many teams wanted to extract the functionality into smaller services that could run and be deployed independently; and as the number of services increased, the SRE team found they were increasingly performing maintenance, which meant there was little time for enhancing the underlying platform. GitHub engineers needed a self-service platform they could use to experiment, deploy, and scale new services.

Several qualities of Kubernetes stood out from the other platforms that were initially evaluated: the vibrant open source community supporting the project; the first run experience that allowed deployment of a small cluster and an application in the first few hours of our initial experiment; and a "wealth of information available about the experience that motivated its design", particularly the acmqueue article "Borg, Omega, and Kubernetes".

At the earliest stages of this project, the GitHub team made a deliberate decision to target the migration of the critical web traffic workload. Many factors contributed to this decision, for example:

The deep knowledge of this application throughout GitHub would be useful during the process of migration.
The team wanted to make sure the habits and patterns we developed were suitable for large applications as well as smaller services.
Migrating a critical, high-visibility workload would encourage further Kubernetes adoption at GitHub.

Given the critical nature of the workload chosen to migrate, a high level of operational confidence was needed before serving any production traffic. Accordingly, a series of prototype Kubernetes "review lab" clusters were constructed. The end result was a chat-based interface for creating an isolated deployment of GitHub for any pull request. Labs are cleaned up one day after their last deploy, and as each lab is created in its own Kubernetes namespace, cleanup is as simple as deleting the namespace, which the deployment system performs automatically when necessary.

To satisfy the performance and reliability requirements of the flagship GitHub web service - which depends on low-latency access to other data services - a Kubernetes infrastructure was implemented on top of the metal cloud that was running in GitHub’s physical data centers and POPs. Many subprojects were involved in this effort, including: using container networking via the Project Calico network provider; following Kelsey Hightower’s Kubernetes the Hard Way tutorial; Puppetizing the configuration of Kubernetes nodes and Kubernetes apiservers; and enhancing GitHub’s internal load balancing service (GLB) to support Kubernetes NodePort Services.

GitHub running Kubernetes

After enhancing the GitHub deployment system to deploy a new set of Kubernetes resources to a github-production namespace in parallel with the existing production servers, and enhancing GLB to support routing staff requests to a different backend based on a feature-toggling Flipper-influenced cookie, staff were allowed to opt-in to the experimental Kubernetes backend with a button in the mission control bar. The load from internal users helped identify problems, fix bugs, and start getting comfortable with Kubernetes in production.

Several initial failure tests produced results that were not expected. Particularly, a test that simulated the failure of a single apiserver node disrupted the cluster in a way that negatively impacted the availability of running workloads. Given the observation of a Kubernetes cluster degrade in a way that might disrupt service, the web application now runs on multiple clusters in each physical site, and the process of diverting requests away from a unhealthy cluster to the other healthy ones has been fully automated.

The frontend transition was completed in little over a month while keeping performance and error rates within targets. During this migration, an issue was encountered that persists to this day: during times of high load and/or high rates of container churn, some of Kubernetes nodes will kernel panic and reboot. Although the SRE team is not satisfied with this situation, and is continuing to investigate it with high priority, they are happy that Kubernetes is able to route around these failures automatically and continue serving traffic within target error bounds.

The GitHub engineering team is "inspired by our experience migrating this application to Kubernetes", and while scope of the first migration was intentionally limited to stateless workloads, there is excitement about experimenting with patterns for running stateful services on Kubernetes, for example using StatefulSets.

Additional information on the GitHub adoption of Kubernetes can be found on the GitHub Engineering Blog.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter