BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How Cloudflare Solved a Congestion Bug in quiche

How Cloudflare Solved a Congestion Bug in quiche

Listen to this article -  0:00

Cloudflare has recently shared how they uncovered an issue in their Rust implementation of CUBIC, a congestion controller algorithm, which prevented it from recovering from a scenario of heavy packet loss at the start of a connection.

The problem affected their open-source Quick UDP Internet Connections (QUIC) implementation, quiche, which sits in the critical path for a significant part of the traffic they serve. The team traced the issue back to a change in the Linux kernel that aimed at fixing a real problem in TCP.

Esteban Carisimo, systems engineer, and Antonio Vicente, principal systems engineer at Cloudflare, explain that their investigation started from reports of unexpected failures in their ingress proxy integration test pipeline. All the failing tests had one thing in common: they were evaluating CUBIC in scenarios of heavy loss during the initial phase of the connection.

As the authors explain, CUBIC and other loss-based algorithms operate on a fundamental premise:

(1) if there is no packet loss, increase the sending rate (i.e. increase the bandwidth utilization); (2) if there is loss, loss-based algorithms assume that the network's capacity has been exceeded, and the sender must back off (i.e. decrease the bandwidth utilization).

To better understand the problem, the team created a simulated setup that runs a Quiche HTTP/3 client and server on localhost, with CUBIC as the congestion controller and a configured round-trip time (RTT) of 10ms. The client downloads a 10MB file over HTTP/3, with 30% randomised packet loss injected during the first two seconds. The test is expected to complete in about four to five seconds, so a timeout of 10 seconds seemed generous.

The results of the simulated test confirmed what was observed in the failing integration test pipeline: in multiple 100-run executions, around 60% of them could not complete before the 10-second timeout.

The team instrumented their implementation to collect detailed information on the faulty behaviour. They noticed that, after the packet loss phase, the congestion window was not growing as expected, showing no signs of recovery. Furthermore, the instrumentation showed during the loss-free phase that CUBIC was experiencing rapid state transitions between two states: the congestion avoidance state and the recovery state. To be more specific, 999 changes in approximately 6.7 seconds: one transition per ~14ms, which the team deemed suspiciously close to the configured connection's RTT of 10ms.

 

Connection overview of a failing test.
Cumulative packet loss vs. Congestion window size with faulty behaviour. Source: Cloudflare blog.

The team also wanted to exclude a more extended issue by replacing CUBIC with Reno, another loss-based congestion control algorithm. The simulated Reno test passed 100% of the time, scoping the problem to CUBIC only.

According to the team, the way CUBIC calculates idle time caused the implementation to enter a never-ending recovery loop. This loop triggered when incoming ACK packets brought the amount of in-flight bytes to zero during a noisy slow start. At a minimum congestion window of two packets, the idle period optimisation, the authors say, becomes a self-fulfilling prophecy. The loop keeps the application's state in a recovery state, with a recovery time aggressively in the future, which effectively blocks the congestion window from increasing.

According to RelevantKnowledge485 on Reddit, others have encountered similar issues:

We hit something eerily similar with high-frequency timer-dependent workloads where C-state transitions were adding unpredictable latency spikes. The debugging approach here — correlating the kernel's power management decisions with protocol-level retransmit behavior — is really solid.

The journey, the authors say, ends well and with a rather simple solution compared to the complexity of the behaviour. “It has a happy ending: an elegant (near-)one-line fix that broke the cycle,” they wrote.

Instead of measuring idle time from the last sent data only, the team also started measuring from the last ACK received.

 

Cumulative packet loss vs. Congestion window size with corrected behaviour.
Cumulative packet loss vs. Congestion window size with corrected behaviour. Source: Cloudflare blog.

This implemented fix was enough to break the loop and to allow the congestion window to recover as expected, restoring the test pass rate of 100%.

About the Author

Rate this Article

Adoption
Style

BT