InfoQ Homepage Podcasts Yuri Shkuro on Tracing Distributed Systems Using Jaeger

Yuri Shkuro on Tracing Distributed Systems Using Jaeger

Aug 28, 2019

Podcast with

Yuri Shkuro

Wesley Reisz

The three pillars of observability are logs, metrics, and tracing. Most teams are able to handle logs and metrics, while proper tracing can still be a challenge. On this podcast, we talk with Yuri Shkuro, the creator of Jaeger, author of the book Mastering Distributed Tracing, and a software engineer at Uber, about how the Jaeger tracing backend implements the OpenTracing API to handle distributed tracing.

Key Takeaways

Jaeger is an open-source tracing backend, developed at Uber. It also has a collection of libraries that implement the OpenTracing API.
At a high level, Jaeger is very similar to Zipkin, but Jaeger has features not available in Zipkin, including adaptive sampling and advanced visualization tools in the UI.
Tracing is less expensive than logging because data is sampled. It also gives you a complete view of the system. You can see a macro view of the transaction, and how it interacted with dozens of microservices, while still being able to drill down into the details of one service.
If you have only a handful of services, you can probably get away with logging and metrics, but once the complexity increases to dozens, hundreds, or thousands of microservices, you must have tracing.
Tracing does not work with a black box approach to the application. You can't simply use a service mesh then add a tracing framework. You need correlation between a single request and all the subsequent requests that it generates. A service mesh still relies on the underlying components handling that correlation.

Subscribe on:

Show Notes

01:06 OpenTracing was created as a standard API for people to instrument their code, with a very specific focus on observability. It does not deal with what you do with that data.
01:30 Jaeger is an open-source tracing backend, developed at Uber. It also has a collection of libraries that implement the OpenTracing API.
02:08 Jaeger has been an incubating project with CNCF since 2017, and hopes to graduate soon.
02:21 Shkuro is an invited expert in the W3C Distributed Tracing Working Group. With representatives from several vendors, the group tries to establish standard formats for tracing. The most important format is how to encode trace context within the request. For example, if you make a call to a Google API, how do you make sure that Google can understand what Jaeger's trace context looks like, and be able to continue to trace.
03:03 Back in May 2019, OpenCensus and OpenTracing merged to create OpenTelemetry. Jaeger sees these as complementary technology, because they only specify how to specify the tracing data to get it out of your system. Jaeger can receive the data and start processing it. OpenTracing focused on tracing, and OpenTelemetry added metrics.
04:02 At a high level, Jaeger is very similar to Zipkin, and this is described more in a blog post Shkuro wrote about the creation of Jaeger at Uber. Jaeger has features not available in Zipkin, including adaptive sampling and advanced visualization tools in the UI.
04:48 There are two reasons why Jaeger was created, rather than adding to Zipkin. Firstly, Uber had custom tracing collection and storage code since before Zipkin existed. The Zipkin UI was created to provide visualization of that custom data. Second, OpenTracing was started at about the same time, with some early involvement from Zipkin, but Zipkin eventually went with a custom implementation. Jaeger always wanted to support OpenTracing data models. There is still compatibility to send data from Zipkin to Jaeger.
07:12 Regarding observability, scale is an interesting question. It can mean either the throughput of the system, or the complexity of the system. When you're dealing with scale in terms of throughput, you can probably get away with just logging and metrics. When you have a microservices architecture with lots of components, tracing becomes critical to understand the communication network.
08:13 Metrics are useful for monitoring individual components of the system because they are aggregatable and provide very precise measurements. But, metrics are less useful if you want to troubleshoot and understand what is happening in a system. Logs can be helpful, but logs at scale become very expensive.
08:41 Tracing is less expensive than logging because data is sampled. It also gives you a complete view of the system. You can see a macro view of the transaction, and how it interacted with dozens of microservices, while still being able to drill down into the details of one service.
09:15 Martin Fowler wrote a blog post that said, "You must be this tall to use microservices." If you have only a handful of services, you can probably get away with logging and metrics, but once the complexity increases to dozens, hundreds, or thousands of microservices, you must have tracing.
10:03 Tracing is a pretty young skill, with not many people using it. One common mistake people make is thinking that implementing any of the three pillars of observability means they've solved the durability problem. Even at Uber, when Jaeger was being developed, the initial assumption was that tracing would help improve latency problems, but that ended up not being an important use case. Instead, tracing was useful for understanding what happens with a request and identify the location of a problem within a massive architecture.
11:42 With OpenTracing, there's a pretty standard way for instrumenting your code, so there aren't too many questions you need to ask when getting started. Once you have instrumentation, then you are able to use tracing for many different use cases, based on the needs of the organization.
13:00 Metrics, logging and tracing each have their appropriate uses. If you want to build alerts, you should use metrics, and not traces, which are sampled and averaged. At Uber, we have an independent, highly-scalable backend for metrics, M3DB.
13:55 As soon as an alert fires and you need to look into why something went wrong, then metrics won't help you, and you'll need either logs or traces. Because you can't afford to log everything on every microservice, sampling will always be required. In general, the sampling of tracing is better than any sampling of logs.
14:45 Logging can be useful for certain things, like troubleshooting the startup of an application, where there is no context, so there is no trace.
15:30 The biggest challenge in rolling out tracing is getting instrumentation into the correct places in your application. This is more organizational than technical. There's almost a conflict of interest. If you are creating a new service, you care about metrics and logging to understand your service. Tracing helps understand the system as a whole, but the owner of one service doesn't get as much value.
16:47 Rather than having services at Uber depend on open-source Jaeger, they build internal wrappers for the Jaeger client which provide customization specific to Uber. This helps standardize the practices across the software development organization.
17:37 The ongoing challenge, at Uber and any large organization, is implementing tracing everywhere.
18:36 Tracing does not work with a black box approach to the application. You can't simply use a service mesh then add a tracing framework. You need correlation between a single request and all the subsequent requests that it generates. A service mesh still relies on the underlying components handling that correlation.
19:15 Ironically, if you try to instrument any application, the propagation of the context is the hardest part of the process. OpenTracing and OpenTelemetry help by providing a mechanism and framework to include the context from ingress to egress of requests within your system. Service meshes aren't able to provide this.
20:21 Context propagation means carrying a correlation ID with a request to all the subsequent microservice requests. Implementing it is heavily dependent on the programming language you're using. For example, in Go, the Context object can be used, but nothing similar exists natively for node.js.
22:23 Percentiles is one type of aggregation you can do with tracing.
22:36 Jaeger generates several billion traces per day, even with sampling rates of 1-in-a-thousand to 1-in-a-million. Obviously people can only look at a tiny fraction of all those traces. If you're not doing aggregations and data mining, then you're not getting value out of the data you're capturing.
23:29 If you look at one, single trace, it doesn't give you any context into the normal behavior of the system. Aggregation allows you to compare the single trace to the normal behavior of the system. Then you can decide whether the one trace is an outlier you can ignore, or if it's part of a common problem to investigate.
24:03 This is the future for tracing, and not many tools do this today.
24:18 The challenge with outages is they typically come from your metrics showing some problem. If you see a problem with high-level business metrics, like riders in New York cannot take a trip, how do you translate that to an architectural placement to identify the problem. At Uber, a custom tool helps compare individual traces to the aggregated average. Shkuro presented this tool at QCon New York. The visualization is open source within Jaeger, but the specific aggregation tool is currently proprietary at Uber.
25:18 Tracing may not actually solve the problem, but it can save 30 minutes of having to find where to start.
26:02 Expedia's Haystack and Apache SkyWalking are application performance monitoring (APM) tools that can provide certain types of aggregations, but they're focused on time series data, not tracing data. Shkuro has not seen any tools that can alert when the shape of the call graph has changed from yesterday.
27:05 Fundamentally, tracing serverless is no different than tracing microservices, but there are some practical implementation details that are different. The typical recommendation is to run Jaeger agent on your microservice host, either as a sidecar or within your host agent. In serverless, this is not possible, so you need another method to talk to the agent.
28:32 In Jaeger, a new visualization tool called deep dependency graph, or transitive dependency graph, is coming soon. Historically, the visualization was only one-hop away. Transitive dependency graphs allow much thorough investigation of the architecture and understanding of the problem space.
29:39 Shkuro is also trying to open source the adaptive sampling feature, which has been in use Uber. The backend constantly monitors how many traces each service generates, and then adjusts sampling policy. For example, you may have an API gateway with one endpoint called once per minute, and other called a million times per second. If you pick one sample rate, you won't have reasonable data for those two endpoints.
30:45 Shkuro's book, Mastering Distributed Tracing began when he started doing tracing around 2015, and there was not much information on how to do distributed tracing. The book is based on his experience with technical and organizational issues around distributed tracing

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.