The GitHub engineering team recently blogged about how they ensure fast and reliable deployments. Raffaele Di Fazio, software engineer at GitHub, provided a deep dive into the deployment mechanics at GitHub.
GitHub is deployed on multiple Kubernetes Clusters and also on physical servers. As the deployment process impacts the GitHub customers and the internal users, safeguarding deployment reliability was critical for the team. The GitHub engineering team started fetching data from the deployment tools in order to fully understand the problem space:
- CI/CD builds duration
- Duration of individual steps of the deployment pipeline
- The total duration of the deployment pipeline
- The final state of a deployment pipeline
- Number of rolled back deployments
- Occurrences of deployment retries in any of the steps of the pipeline
The team also analyzed some general metrics, such as the time taken for a pull request to merge along with the number of pull requests deployed/merged. Many of these metrics collected align with the four key metrics identified in Forsgren et al’s "Accelerate" that differentiate between low, medium and high software delivery performers: lead time, deployment frequency, mean time to restore (MTTR) and change failure percentage.
As the customer-facing GitHub is a monolithic Ruby on Rails app that is deployed onto Kubernetes, the team has to start the pods in multiple Kubernetes clusters when deploying a new version. The blog cited a situation where the deploy tooling did not provide much information, making it important to look into Kubernetes for the issue resolution.
"We introduced changes to our tooling to provide better information on a deployment while it is being rolled out and proactively providing specific lower level information in case of failures, which includes a view of the Kubernetes events without the need to directly access Kubernetes itself."
The GitHub engineering team also started tracking several Service Level Objectives (SLOs) with deployments. Unlike the traditional way of using SLOs for a web application's success rate or latency, this approach with SLOs allowed the team to shift their focus towards improving the production deployment process.
As discussed in a recent Failover Conf talk by Danyel Fisher and Liz Fong-Jones, Pitfalls in Measuring SLOs, collecting and analyzing this data is not trivial. The Google Site Reliability Engineering books (made freely available online) are still the recommended starting point for readers interested in learning more.
At GitHub a dedicated team is responsible for the continuous deployment of applications, which helps deploy applications and develop the best practices. The SLOs help the team with prioritization to ensure reliable shipping of more features and improvement deployments in a given day.
The Building GitHub blog series provides a deep dive into GitHub Engineering Organization, showing how teams across the organization have been working together behind the scenes. The posts in the series started back in Q3 2020. A previous post in the series described how the teams improved deployment experience at github.com.