BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News GitHub Uses eBPF to Eliminate Deployment Risks and Prevent Circular Failures

GitHub Uses eBPF to Eliminate Deployment Risks and Prevent Circular Failures

Listen to this article -  0:00

GitHub has introduced a new approach to improving deployment safety by leveraging eBPF, enabling the company to detect and prevent hidden circular dependencies that could block recovery during outages. The technique, detailed in a recent engineering blog, allows GitHub to monitor and selectively restrict network behavior of deployment processes at the kernel level, ensuring that critical systems can still be updated even when parts of the platform are unavailable.

The innovation addresses a long-standing risk in large-scale systems: circular dependencies, where deployment tooling relies, directly or indirectly, on the very services it is meant to fix. GitHub highlighted scenarios where deployment scripts might attempt to fetch binaries, call internal services, or trigger background updates that depend on GitHub itself. In failure conditions, these dependencies can cascade, preventing remediation and prolonging outages. By using eBPF to isolate deployment processes and control their outbound network access, GitHub can proactively block such calls and surface them to engineers before they cause incidents.

At the core of the solution is eBPF's ability to run custom programs inside the Linux kernel, hooking into low-level system events such as network requests. GitHub uses this capability to place deployment scripts inside controlled environments (cGroups), where their network traffic can be inspected, filtered, or blocked based on predefined rules. This allows the platform to enforce fine-grained, per-process network policies without affecting the broader system or production traffic.

To overcome the challenge of managing dynamic infrastructure, GitHub extended this approach with DNS-aware filtering. By intercepting DNS queries and routing them through a proxy, the system can evaluate outbound requests based on domain names rather than static IP addresses, making it far more adaptable in large, fast-changing environments. The system also maps blocked requests back to specific processes and commands, giving teams clear visibility into what triggered the issue and how to fix it.

Traditionally, identifying circular dependencies has been a manual and reactive process, often discovered only during incidents. GitHub's approach shifts this to proactive detection: if a deployment introduces a risky dependency - whether direct, hidden, or transient - the system flags it immediately. This reduces the likelihood of deployment failures during outages and improves mean time to recovery by ensuring that remediation paths remain available.

The system has been rolled out over six months and is now actively used to safeguard deployments across GitHub's infrastructure. It also provides additional benefits, including auditing outbound calls during deployments and enforcing resource limits to prevent runaway scripts from impacting production workloads.

GitHub's use of eBPF reflects a wider industry trend toward kernel-level observability and control as systems grow more complex. Increasingly, organizations are turning to eBPF not just for monitoring, but for enforcing runtime policies, improving security, and managing system behavior in real time. The approach allows platform teams to move beyond traditional application-level controls and gain deeper visibility into how systems behave under real-world conditions.

The development also highlights a key evolution in deployment practices: ensuring that systems can recover from failure. As platforms become more interconnected, hidden dependencies can create unexpected failure modes. By embedding safeguards directly into the operating system layer, GitHub demonstrates how modern infrastructure can be made more resilient, ensuring that the tools used to fix systems remain independent of the systems themselves.

Other large-scale platforms face similar challenges around hidden dependencies and deployment safety, and many are adopting comparable, but not identical, approaches. For example, Google has long emphasized dependency isolation and hermetic builds within its internal systems, such as Bazel, ensuring that build and deployment processes do not rely on external or runtime state that could fail during incidents. This reduces the risk of circular dependencies by design, as deployments are constructed to be reproducible and self-contained. Similarly, Amazon Web Services promotes cell-based architecture, where services are segmented into isolated units so that failures and their dependencies are contained, ensuring that deployment and recovery paths remain available even when parts of the system are degraded.

In the cloud-native ecosystem, projects like Kubernetes and networking layers such as Cilium are also evolving toward runtime policy enforcement and observability at the kernel and network layers, similar to GitHub's use of eBPF. Meanwhile, platforms like GitLab focus on pipeline isolation and dependency control, encouraging practices such as artifact pinning, offline runners, and restricted network access during CI/CD execution.

Across these approaches, a common theme emerges: rather than relying solely on process or documentation to avoid circular dependencies, leading platforms are embedding guardrails directly into infrastructure and execution environments, ensuring that deployment systems remain reliable even under failure conditions.

About the Author

Rate this Article

Adoption
Style

BT