A customer backup job running an hour late on a Friday evening should have been straightforward to diagnose. For engineers at Gearset, it wasn't.
Despite comprehensive dashboards, metrics, and logs, the team found themselves guessing at root causes rather than identifying them. Metrics showed the forest but not the trees; logs showed individual trees but made it nearly impossible to trace a path between them.
During their QCon London 2026 presentation, Julian Wreford and Oli Lane from Gearset explained how distributed tracing with OpenTelemetry filled that gap. Tracing provides a hierarchical structure that groups events from a single operation, offering visibility into cause and effect across service boundaries. While HTTP tracing often works automatically, queues require custom work to maintain context. The team implemented OpenTelemetry's context propagation standard by creating wrappers for their queue clients, attaching trace IDs and parent span IDs as message metadata ensuring the full journey of a request remained intact.

A central theme of the transition involved moving away from infrastructure metrics toward Service Level Objectives (SLOs) based on customer experience. The speakers used a traffic analogy to explain why alerting on queue size is often misleading. Just as Google Maps alerts drivers based on expected delay rather than the number of cars on the road, the Gearset team shifted to alerting on latency. They noted that a thousand items on a queue might be processed instantly, while five items could be significantly delayed. Latency is more stable and directly reflects the customer experience, reducing the need for constant re-tuning as system characteristics change.

To implement this strategy, the team adopted a three-step framework: define the Service Level Indicator (SLI), set the SLO, and then configure alerts.
They bucketed events into good or bad categories, such as whether a message was processed within two seconds, and defined an error budget for a target of 99.9% success. Once an alert fires, engineers can jump directly from a macro visualisation of the distribution to specific problem traces. The team shared a specific implementation trick for tracking total duration in asynchronous traces by using OpenTelemetry's trace state to embed and propagate the root span's start timestamp. This allows for calculating the time since the trace began for any child span, regardless of how many queues or services it has traversed.

The adoption of "wide events", which involves attaching as much metadata as possible to spans, unlocked discovery-based debugging. By including created, sent, and received timestamps, with FIFO subgroup IDs, the team could query attributes in real time to discover hidden waste. In one instance, they identified a long-standing bottleneck caused by an obsolete piece of code that was easily removed. The Gearset engineers also highlighted the role of the OpenTelemetry Collector, which they use to automatically enrich traces with Kubernetes metadata and scrub sensitive data before it reaches their query engine.
While the technical implementation was significant, the speakers emphasised that cultural change was the most challenging aspect. The speakers recommended engaging with teams on their own terms and proving the value of tracing through the resolution of real-world incidents, avoiding the pitfall of promoting the technology as a universal cure for all operational problems. When engineers see the tangible benefits of discovery-based debugging, such as splitting "mega traces" into navigable sub-traces using trace links, they naturally begin to enrich their own spans, turning observability from a top-down mandate into a self-reinforcing practice.