BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How GitHub Improved Code Push Processing Reliability

How GitHub Improved Code Push Processing Reliability

This item in japanese

GitHub has rolled out several technical upgrades to enhance the reliability and efficiency of code pushes, one of the frequent actions performed by developers on the platform. This move addresses potential issues and aims to provide a smoother experience for users who regularly push code to GitHub.

William Haltom, software engineer at GitHub, elaborated on the background of this technical upgrade. To set the ground, Haltom shared that pushing code to GitHub sets off a series of actions, such as synchronizing pull requests, dispatching webhooks, triggering workflows, installing apps, publishing GitHub Pages, and updating Codespaces configuration. In addition, over 60 internal processes within GitHub are activated with each push, enabling different features and automated tools for developers.

Previously, handling all the actions triggered by a code push was done through a monolithic background job known as the RepositoryPushJob. This job, inside GitHub's Ruby on Rails monolith, sequentially executed all push processing logic. However, there were issues due to its size and complexity. The retrying individual tasks within the job was difficult, and most steps weren't retried at all.

This lack of reliable retry mechanisms meant that errors in the early stages of the job could cascade and impact subsequent steps, creating a wide range of potential problems.

Source: How we improved push processing on GitHub

GitHub revamped its code push process by breaking down the long, sequential job into multiple independent, parallel processes. To achieve this, they implemented a new Kafka topic to broadcast push events. Then, they analyzed and categorized the numerous push processing tasks based on their owning services or logical relationships, such as dependencies and retry requirements.

Each group of tasks was assigned to a new background job with a designated owner and appropriate retry settings. These jobs were then configured to be triggered by the new Kafka events.

For this architecture, GitHub utilized an internal system to queue background jobs in response to Kafka events. Several improvements were made, including developing a reliable publisher for the Kafka events, setting up a dedicated worker pool to manage the increased number of jobs, enhancing observability to monitor the push event flow, and establishing a system for consistent feature flagging to ensure a safe and controlled rollout of the new system.

Source: How we improved push processing on GitHub

GitHub recently made news by introducing Arm64 support on GitHub Actions, providing developers with Arm-built images to release their software on Arm architecture. This announcement sparked a conversation within the tech community on Hacker News. Obviyus, a GitHub and HN user, expressed their excitement for the introduction of Arm64 support, stating that they had been relying on self-hosted Arm runners for their projects. They noted how compiling code on their small Arm VPS could significantly slow down other tasks and welcomed the addition of official Arm64 support as a much-needed improvement.

Earlier this year, one of the Hacker News posts also discussed the Copilot Workspace, a tool designed to streamline the development process by enabling developers to use natural language to brainstorm, plan, code, test, and execute projects.

Haltom further explained the results of the architectural revamp, stating that the smaller, decoupled processes have led to a reduced blast radius for problems. Issues with one part of the push handling logic no longer cascade and impact other areas, leading to improved stability and reliability. This decoupling has also decreased dependencies.

Additionally, the new architecture has clarified ownership, distributing responsibility for push processing code among more than 15 service owners. This allows teams to add and iterate on push functionality without unintended consequences for others. Lastly, the smaller, less complex jobs allow for more reliable push processing.

About the Author

Rate this Article

Adoption
Style

BT