Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Conquering Microservices Complexity @Uber with Distributed Tracing

Conquering Microservices Complexity @Uber with Distributed Tracing



Yuri Shkuro presents a methodology that uses data mining to learn the typical behavior of the system from massive amounts of distributed traces, compares it with pathological behavior during outages, and uses complexity reduction and intuitive visualizations to guide the user towards actionable insights about the root cause of the outages.


Yuri Shkuro is a software engineer at Uber Technologies, working on distributed tracing, observability, reliability, and performance problems. He is the author of the book "Mastering Distributed Tracing"; the creator of Jaeger, Uber's open source distributed tracing system (a CNCF project); and co-founder of the OpenTracing and OpenTelemetry CNCF projects.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Shkuro: This is a talk about distributed tracing microservices, and about complexity of those things. An agenda - I will introduce briefly what distributed tracing is and why we're using it. Then I will talk about different ways of using tracing data that we found useful at Uber. Basically there are four sections that I will be presenting, and then we have time for Q&A.

Just to introduce myself, Suzanne mentioned I work at Uber in observability team, been there about four years. I primarily work on distributed tracing and our tracing system, Jaeger. It's open-source, we donated that to Cloud Native Computing Foundation a couple of years ago.

I also work with two other CNCF projects OpenTracing and OpenTelemetry, which are the sort of the instrumentation side of OpenTracing and of tracing and metrics. I'll touch on those briefly. Also, as Suzanne mentioned, I published a book recently, I had about four years of experience doing the tracing so I decided to write about it. It covers a lot of tracing history, current state of the industry, a lot of different standards and it's also a very practical guide. You can just dive in and start instrumenting the applications and plus a lot of advice from my own experiences. If you're interested, you can find it from my website.

Before I start, I just want to take a quick poll. I'm assuming everyone is using microservices if you are in this room, but how many people actually know distributed tracing is? I think it's roughly 40% maybe. How many of you actually use distributed tracing in production? It's a good amount, about 10%.

Why Distributed Tracing

That's worth speaking about why we're doing distributed tracing. It's not a new thing, the papers on X-Trace and Dapper at Google appeared more than 10 years ago but tracing is gaining in popularity recently. In part it's because I think of the microservices movements as well. I want to cover a few trends in the industry that lead to why tracing becomes important now.

One is not really a trend, it's a done deal, but as internet companies started having to serve millions of users, then basically, we had to start building very complex distributed systems, scale from one computer to 200,000. That basically created a lot of complexity already. The other trend is, as people were developing monoliths, you could scale the monolith to 1,000 hosts, that's fine, but at some point, as your organization grows you reach a point where the monolith creates an organizational challenge.

You can't deploy features fast enough because everything is intertwined, and so people start breaking up monoliths into microservices, which allows a lot of different benefits, but also microservices creates enormous amount of challenges for operating them and understanding them, which is what I'm going to be talking about.

A third trend is an evolution of our program and paradigms. Traditionally, maybe in early 2000, if you had the Java server it would run a few threads, and each thread would execute one request at a time. The logic of that was pretty simple and if you need to debug that system, it was fairly straightforward. Then we've introduced new types of concurrency. We introduced asynchronous programming where a single request like think about futures or promises or event-driven frameworks where a single request can start with one thread and get paused, get put on another thread. If you try to debug that, that becomes much harder because, where do you look for it? How do you look for a log, for example, from that request?

Finally, when we blew that up into microservices world, we created the distributed concurrency where a single request not only jumps across threads but across threads in different processes, different nodes. Understanding these systems and debugging them became really hard. Not that it was easy before with distributed systems.

The culminations of these is sort of microservices architecture like the one at Uber where we have over 3,000 services, and they're all in multiple programming languages. This diagram was rendered by Jaeger, is a portion of that architecture. I think nodes represent the services and the size of them is roughly proportional to the traffic, and edge is a connection between those services, the RPC calls that they make.

What's interesting is that if you click a button at Uber, a transaction can hit that architecture which looks like this. It can touch hundreds of nodes, multiple types of surfaces from different business domains. What do we do when something goes wrong in this architecture?

Another thing that microservices architecture introduces is an increase in the number of failure domains because we know that, if you read any distributed systems paper, they would mention communications are not reliable. This is the basic premise of distributed systems design. Well, microservices - let's bet down on this unreliable communications and use it for everything inside our system. This number of communications increases exponentially in the microservices architecture and that increases the number of failure modes.

In order to operate such complex systems, it's almost impossible without very strong and good observability systems into what's going on. Specifically, we can't understand what's going on in the system if we're looking at just one single node of it. From the transaction before, if it touches 50 nodes, I need to understand what happened along the path of that whole transaction before it can reason about what happened in the request. We need to monitor the distributed transactions.

Now I want to touch on what is observability. Specifically, how is it different from monitoring? Sometimes people think that observability is a fancy name for monitoring, but I think of them very differently. To me, if you think about the word monitoring from the dictionary, it says you decided what to measure. You decided that it's important for you to measure that and you monitor that. That is not the thing that humans should be doing in production because there are so many things that are important and you need to monitor so you just go and automate that. You put alerts on those things. To me, monitoring those things that you knew a priori are going to be important, and that you need to automate completely. The human should not be involved in monitoring. The human is involved once the alert fires, then you have to involve human to debug the problem.

Observability is really an ability to ask questions about your system and those are the questions that you don't know upfront. That's the main difference for monitoring. Distributed tracing is in a unique position across all observability tools, it can answer questions that no other systems can answer. I'm not going to read all this but just to give a taste, if you have a long request, like "Which service created the fold in that request?", how do you find out? What if you got an alert from your business metrics saying, users can't take trips on Uber?

That's what we should monitor, that's the metric that's important to us, but how do you go from that to some service X that's actually misbehaving or having a capacity issue? That's what we want to be able to answer. These are types of questions. The last one question is kind of funny, but it's one of the most important one because if you imagine the architecture like 3,000 microservices at Uber or any organization which has a sizable architecture, I can guarantee you there's not a single person that knows what those services are doing. It's impossible to keep everything in your head.

If you are responding to an alert about the business metrics falling down, your first goal is to basically isolate where the problem is, and find the person who can actually solve that problem. You probably are not going to know what the service X is actually about and what it's doing. Distributed tracing is in a unique position to point you quickly in that direction.

I couldn't find, unfortunately for the slide, but there was a recent tweet from Charity Majors, CEO of Honeycomb. She said that one of the hardest things in observability is actually not finding why something is failing but where it's failing, and that actually rings very true to me because I've seen that at Uber with the first responders, where they just spent half an hour trying to narrow down the problem to a service. Then once it's there, figuring out the problem becomes much easier. Distributed tracing is very well-positioned to answer these questions and accelerate root cause analysis which is an important part in outage mitigation.

Let me start just very quickly an overview for distributed tracing and how it works. The idea is exceptionally simple, you have a request coming in, in architecture, we assign the unique ID, and we make sure that we pass that unique ID through all the calls that that service makes and all the downstream services make. If you're using messaging, it works the same way.

While we're doing that, we have a background process in each service which collects the telemetry data tagged with those unique IDs, and then sends it out to the tracing system, which allows us to reconstruct basically a workflow, the lifecycle of that request and it collects time and data, causality data, who called whom, things like that, and all kinds of other attributes about the request, like what's the URL for a given endpoint.

On the right side, you can see that there's a very typical Gantt chart view of the trace where it's a collection of spans. I will probably be using term "span”, so I want to bring it up upfront. Span is just like one unit of work, usually like a service endpoint, but could be something smaller within that as well.

Trace as a Narrative

What trace is, it's a narrative about a request. It's what happened to that request in a distributed architecture. Let's take a look at the real example of a trace. It's not quite real, it's a toy example still. It's Gantt chart, and we can see that on the left side, there is a sort of a hierarchy of parent-child relationships between services and operations, who called whom?

Then, we have a timeline. With the time measures above the timeline there is a mini-map of that whole trace because sometimes the full trace is very many pages long, so if you want to get the grasp of what it looks overall, then that mini-map shows you the total structure of it. It's specific to Jaeger, but some other tracing systems have that visualization as well.

Then some of the things that tracing makes immediately apparent is that here's the blocking transaction. This operation, MySQL query, blocks all the up calls above it. It's nothing can proceed until these things. We can very clearly see that just from the layout of the Gantt chart.

Then on the other hand, on the right side, we have a series of steps which look like a staircase that indicates that there's a number of operations happening, but they all seem to be happening sequentially. If you are troubleshooting a specific performance issue, then that clearly points you to places where you can potentially improve things. Try to reduce the SQL or try to paralyze these things why are they doing it all in serial. There might be good reason why they doing it in serial, but in this case, if you knew that application, that there isn't, that could be paralyzed.

Finally, there's that tiny thing, this is like a red exclamation point. When there is an error, instrumentation captures that and tags it with a span and so we can display it. It's not very visible, but that there sort of bubbles up if you start collapsing the hierarchy on the left, so sometimes it's actually very easy to navigate from the very top one. You will get an error, you expand it out, the error actually move to one of the five services or something and you can drill down this way.

Finally, when you click on one span, you can get into details about that span and those details are usually very rich. You obviously have some timing information when it started but you also have two sets of metadata called tags and logs. Tags are more like some key-value pairs attached to span. One of them, for example, is this kill query, which could be very useful when you're investigating a problem that it could be very detailed. You can include all the parameters of that query. There are very few tools that give you that kind of information.

The logs - you can think of timed events within the lifecycle of a span, but they look like logs, just contextually limited to the span. I think an important distinction of the tracing as a tool, is that all information in a trace is very contextual. If that SQL query belongs just to this span, if you have another span, there is going to be another SQL query. Or the logs, they're also structured within the single span, so that's very different from the normal login tools where you just have a stream of things and then you need to somehow filter them and still, "What happened in this one single request?" Even if it's on a single thread, it's hard.

Tracing makes it a lot easier to navigate. I spoke to one of the users of Jaeger open source and they said they stopped doing logging altogether and they log only through the traces because that allows them much easier investigation of things rather than doing a normal login into ELK or something.

We can also trace asynchronous workflows with this tool. This example was purely RPC but if you have a system-based on Kafka - in fact, when I was writing my book I had the chapter about exactly that, so I decided to put it in this slide. There's this mock application Tracing Talk, which is online chat application where you can chat with someone and then you can also put a giphy and some title and it will pull an image, very standard thing.

The architecture of that application looked like this. You have a JavaScript frontend, an APS, API server which receives the messages and that's what represents the actual trace, like the transaction coming into the API service. Redis is used to store the messages, Kafka is used to communicate between all the components within one, and then making calls to the giphy as the external port. If you take the trace, that's what it looks like, very similar to the previous RPC trace.

On the left, we have the hierarchy of sort of operations happening in another, and then this span, it says storage service receive, like chat API send. That's the chat API application sending a message and that's recorded as a long span. I don't why it's long, but it's performance issues with the drivers. Then the other two spans, the tiny ones are the two other services receiving this message. Even though it's a Kafka, it's a messaging application, you can still use tracing and investigate and all the power of that tool that I described before is still there. You can drill down and do the interesting things.

To summarize this, so a single trace is very powerful, rich source of data to investigate the behavior of a given request. It tells the story of what happened for that request across all the microservices and it's contextual. It allows very deep drilldown into what exactly happened at every step, the more instrumentation you put in, the more details you will get. It's like a knob you have to control.

You can think of it as a distributed stack trace sometimes. This is one of the issues with microservices, as you split the monolith. In the monolith when the exception happens, you get the full stack trace so you know where the things happened. Whereas in distributed system, you just get one single piece of it, but trace gives you the stack traceback essentially across microservices.

However, there are some issues with using that as to solve all the troubleshooting problems. One of them is, it talks about a single transaction. How do you know that that transaction is actually representative of anything? Yes, if it was a specific problem that you were trying to investigate and that's the trace representing that transaction, then it's useful, but what if you're trying to analyze some performance issue? Let's say, users can't take rides everywhere, what happened? You drill down, you need to find out what causes this. If you pick one transaction, and it has some issue in it, that may not in any way correlate with the issue of the actual alert that you get.

Another problem which you couldn't see in the previous examples is that traces themselves can get exceptionally complex. I will show some statistics soon. This is a real example from Jaeger production at Uber. Driver goes online, that's a type of request and when the drivers start their app. We can see here that there's a number of 30 services involved in that request, 200 spans, so it's about 100 or more RPC calls in that request. That's a pretty big thing. I couldn't display it, so I put a mini-map at the bottom, you can see that's pretty complex structure already. If you put it as a Gantt chart on the screen, then you have pages and pages to scroll if you want to try to actually reason about this thing. That's not the largest trace at all.

Here I have a heat map in Jaeger production, it trace sizes at Uber. We can see that most of the time we're in the small ranges, but we pretty often get traces which are worth about 60,000, 50,000 spans in a trace. Occasionally we get 100 or 500, sometimes we get this one with 10 million spans in it. What do you do with that? How do you even visualize this thing or try to troubleshoot that?

There are various techniques of doing that. One is more number-based, and actually, Facebook is doing that. They have a very advanced infrastructure because they've been doing tracing for 10 years, where you can run a feature extraction on the two populations of traces and then they compare them, like the performance characteristics, saying, on average this call was here, but in this population this same call is taking twice as long, across many traces.

That removes the anomalies, it’s a very good one. One issue with that approach - maybe not issue but going back to the observability versus monitoring - is that when you start talking numbers, that means that you are already looking for answers to questions you knew to ask upfront. You decided that that number is going to be important. I'm approaching it from what if I didn't know what to ask? If I knew what to ask, maybe I even knew what to fix in the first place. Why should I have the monitoring for this one? I really want to be able to ask questions that I didn't think I needed to ask upfront.

That led us into a different direction with our tooling at Uber. We decided to explore various visualization techniques that allow us to tackle complexity of these traces in a different way. One of them is taking a different representation of the trace. Instead of a Gantt chart, it's just the graph of RPC calls. It's vertically time-ordered so you can read it again as a narrative, these things happened in that order.

Another thing is that a lot of times services make repeated calls to each other. If you call in Redis – in the previous example, the toy example where it was like a staircase, there are 10 calls to Redis from one service. In this graph, there's going to be just one single edge, we collapsed the repeated edges because while they're important if you're actually drilling down, they're still there. It's just visualization, it's not like an aggregation.

In order to comprehend them, they're just noise, they don't add too much to your comprehension. This are basic techniques that may not be exceptionally useful just on this graph, but we will keep using them for more visualizations that are helpful that I'll show later.

Another thing that we also tried is the color-coding here. The color-coding is, I think, by service, which again, is not super useful. Although you can even imagine other color-coding, maybe if you color it by the type of technology using by the service, so you can say, "These are my Java services, these are my database services," things like that. That might be a lot more useful utilization than just a random by series color-coding.

The other type of color-coding is a heat map or latencies. Here we essentially just use the red heat map to represent the longest duration of a span, and that gives you immediately a pass to understand that if you're looking for latency in this trace, it's obvious where it is. The top one is obviously the longest one because it was the experience but if you follow in and down, this is the section where bad things happen. You see that and that's your pass to understand this.

However, it's not good enough. Imagine you have two software, two versions of a software deployed and the latest version has a bug, so there's an error or something. How do you understand that? Do you just go dump the source code of that latest version and start reading it front to back? No, you look at the diffs, what happened between those two versions. That's the primary way to understand what's changed in the code.

Trace vs. Trace

Diffs are actually are the traditional way of looking at performance profiles and comparing them. If you are investigating a memory leak, how do you do it typically? You take two snapshots and you compare them. That was natural for us to explore that we can do that with traces as well.

This leds us to a tool which does comparison of two traces. You can think of it as, you pick a trace which is successful at the given endpoint and then a trace which failed. Then you compare them and you try to understand.

Looking at one of them is very hard. Like we've said, there can be millions of spans in it, but comparison reduces that complexity a lot. Here is an example that I'm going to walk through. The left trace has 500 spans, the right one has 300 spans, so it's already twice as large as that scary trace that I showed before for a driver going online.

The same principles of graph constructions apply here. It's like a chronologically sorted top-down and repeated edges are removed. Then, we do the structural comparison of this here, and we color-code that structural comparison. Number one, we see a bunch of gray nodes, which means that they are present in both traces on left and right, so there's no change in the behavior.

Then, at the bottom, we see there's a whole section in red and the color-coding is inspired by the code diffs tool so that everyone is familiar with it. This means it's missing from the right, it's present on the left but missing on the right, that's the red color.

Then, at the top right we have slightly lighter colors. Because the edges are collapsed, there might be differences in number of those edges. Those colors’ structure is very similar, but there are some statistical differences in how many calls happen between these things.

Finally, the important part of the tool was that you look at this and you immediately see that there's something wrong going on at the bottom of it because the whole swath of calls didn't happen. If you want to investigate the problem, that's probably where you should look for it.

This is actually a real production trace from Uber Eats application, so the order couldn't complete. I think there was some problem with a credit card so transaction rolled back, and therefore the whole section below didn't happen in the failed trace on the right, so that's the explanation.

One other thing is that because this is just visualization and not an aggregation, you can go back to the raw trace and investigate. We can go to the failed trace, and we drill down and we can see that these are the details, and this is an exception saying you have outstanding bonds. That's why the credit card check failed or something like that.

We could have gotten here by looking at that trace from the beginning, but if that trace has 10,000 spans then it would have been pretty hard. Whereas with the diff tool, first we narrowed down the place where the problem resides, and then we have a deep link from that going into the spans details that you can drill down further.

In other production story, last year or maybe two years ago, we're doing the migration from one data center to another, and the data centers were nearby. What I mean by nearby is, in America distance is measured in units of time. "How far are you?" "I'm 10 minutes away." Same here, the data centers were few milliseconds away. When we migrated a whole bunch of services to the second data center, we were expecting latency to not be affected much. Within a second it's fine but instead, we saw that the latency almost doubled. We were trying to investigate why that is. We tried that difference tool that I just showed before and that was the outcome. There's very similar structure, a lot of gray nodes, not a lot of changes. The 2.7 seconds and the right one is 2-4 seconds, so that's almost 50% increase in latency rate.

When we look at the differences here, we see the new section of calls being made, which is kind of strange the moving services to data center, but we still don't know from this view whether these green nodes actually responsible for this massive increase in latency. Another option may be this whole increase is actually spread out across multiple nodes because this view just doesn't show that, it's not built for a latency investigation.

As a result, we said, "Let's use the heat maps instead," and that's the color-coding we already saw. That became a lot more useful in this case. We saw that before but this is the difference now between two traces. We have a section where there are no changes, there are similar durations between thr spans. Then we have a section with white nodes, that's saying we can't compare it because that was the section that was new in trace B, so we don't know what's going on there. But even if we ignore that, we can see that there is already a very clear path of where the latency was introduced.

Once we drill down into that service, we realize that when the migration happened - 3,000 microservices, you can't actually migrate all of them at once, it's a process - some of them weren't migrated properly, and so that particular one was experiencing latency because of that. It didn't understand that the configuration was messed up, so instead of calling locally within the data center it was calling across the region and introducing these huge delays. It was doing it many times so that was responsible for the issue.

You can obviously mouse over and see the details, it says what’s the percentage of latency increase. This one is the most responsible. This one is about 50% of total latency. Another type of heat map is the so-called self-time over span where you take all the time spent on the children, subtract it, and then keep that only as your indicator for latency. That highlights the different types of issues.

To summarize, how are approaches these different from just a single trace? One surfaces a lot less information about the trace, we condense that information, and that allows us to actually comprehend the complex structure of a trace much better, and emphasizing the difference is really the key part because when you investigate the performance that gives you a lot narrower surface to focus than if you look at the whole thing at once.

Finally, distinct comparison modes are very helpful because you can tune them to specific problem that you're investigating, and they're helpful more in that way. There are some challenges with this one still. What are traces A and B? One is successful, another is unsuccessful, but first of all, you have to pick the right baseline. What is a successful trace? Even if you pick one which had no errors, which looked normal, from statistical distribution, it may still be an anomaly - something wrong happened here, on average, it doesn't happen. You're comparing, necessarily and not necessarily the best things. What you really want is to comparea trace versus some sort of average behavior of the system. That gives you a much better understanding of how that failing trace is actually different from the normal behavior. For that, there's a next tool that I will present right now.

Traces vs Trace

We have millions of traces in the database so creating an aggregate of the behavior is not hard. By the way, up until now, I think all of that was open source in Jaeger. This new tool is internal for Uber for now, we haven't spent time on open sourcing.

The tool is specifically designed for root cause analysis. It uses slightly different color-coding and some other improvements, but overall, it's still very similar. We have a top-level, iteration, and endpoint. It's actually empty here, but you have a request and response that can contain a payload of the request, sometimes very useful to drill down to the top outcome, you can understand what's going on. Also, you can always go back to the raw data representing that trace if you want to drill down more on the details.

Finally, missing nodes are represented in black. Our attention is immediately visually drawn to this place saying, this is the area of the trace that something bad happened. The color-coding is different, the white is "no change" and red is reserved for nodes with errors specifically happening because that's also sometimes very useful to look at. This is a node that shows there's an error, and then there's actually two errors. When you click on that you get the details on the right that describes what type of error was actually happening there.

This tool has been exceptionally helpful in the past few months at Uber. Once we rolled it out, when the first responders go from the business metrics failing to investigating where the issues occur, they had anecdotes were instead of 30 minutes now they spent 2 minutes to solve that thing. To solve that biggest problem was, where the things are failing, not why. In some cases it tells you why, in some cases it doesn't, but at least where is the very time-consuming thing. Very often, once you know where you can already start to think of how to remediate it without even figuring out why it's failing because you can say, "Let's just bypass this service. Let's reroute around it." There are lots of opportunities at that point once you know where things are wrong.

What is the main difference of this approach? Mostly, it's much broader context. You don't compare one trace with another where the other one could be an anomaly and not represented, if you're really comparing with the population, which is guaranteed to sort of smooth out outages. The other one is that that tool was specifically built for root cause analysis. We're not trying to make it a very general-purpose investigative tool or even performance. There's no performance analysis in that tool, outage mitigation is the first thing.

Tackling Data Complexity

My last section is, complexity of microservice is not just about this sort of communications of microservices, it's also about data complexity. Imagine how much data actually goes through that architecture and tracking that data is also important, and trying to understand where it goes.

Uber obviously is a data company as such. We compute ETAs, we compute roots, we use a lot of data all the time to actually operate the system. Our typical sort of flow of data is that there is a lot of exchanges happening in the RPC space, then there is some asynchronous processing via Kafka messaging. Then, in the end, some of that data ends up in the data lakes with HDFS.

Data Lineage

Understanding the quality of the data can be very hard. If you have a Hive table and suddenly it looks completely wrong, and you're trying to understand where this data came from, it came from those 500 transformations upstream. Now you need to be able to trace that somehow. Tracing is not necessarily always able to solve that, but it's part of the solution. The solution is that we will build a tool internally. I'm sure it's too small, but it just conceptually. On the left, there's the RPCs space, something happened and then the database transformed, then in the middle there is some Kafka stream, and then on the right side, there's a series of Hive tables like Kafka ends up in one table as is and then it gets transformed, potentially adjoined.

This tool actually looks at individual data elements. In this case, our focus was the Kafka stream. How are the messages in this Kafka stream? Where do they come from, and where they end up? That's what the graph shows. You can refocus it to other things, you can also filter down to specific data elements. Let's say, I'm not interested in the whole trip, I'm interested in the start point of the trip, where that came from. You can drill down, the tracing part here is really the RPC space and some partially Kafka space, but the Hive transformations are done with static analysis of the relationship between tables.

Tracing sometimes has a problem with that space because of tracing works when you have a very well-defined transaction. When you start joining and aggregating data the notion of a transaction becomes very weak, because you're actually combining a lot of different data. Tracing naturally stops there, you can't really model that only just on the logical scale.

Just to close this quickly, all of this is only possible if we have very high-quality instrumentation in the systems and instrumentation is actually the hardest part of tracing. Getting it in there and collecting data is hard, building tools is the easy part. Today, all our software's very highly composable, we tend to use a lot of open-source frameworks in our application. If you have a microservice, you probably have some sort of framework that you use. You have some RPC framework to make calls downstream, some database drivers, some message drivers.

If we want to trace all of that, we need instrumentations and add all those components, and that instrumentation needs to work together. If they're open source, you need some sort of a standard to actually have them speak the same language because if they don't, we have traces which come into your service and die, and then the next race goes out of your service. That's not what we want. We want to continuity of the trace.

I want to just call out to two projects that are happening in industry today. One is in OpenTelemetry, a new project announced at CNCF. It's a next iteration of two previous projects, OpenTracing and OpenSensors, if you've heard of those. Those two projects decided to merge efforts and basically become one single standard rather than competing standards.

The other one is, in the W3C, there's a working group, which works on the data formats for tracing. That includes how you pass data between processes, so-called trace context, and in the future it will also include how you send traces out of process. Early work started but hasn't been done a lot of it.

To summarize, distributed tracing is definitely one of the most helpful tools for well, observability tools at least, for understanding and tackling the complexity introduced by microservices. Visualizations are an important part of that because the tools itself collect so much data that it's very hard to just look at that data in the raw format. Visualizations help our visual cortex to process that data much quicker and narrow down things that are important out of this massive data.

Finally, distributed tracing - I personally think we've just scratched the surface with that because we think that it's going to be like a core driver for automating a lot of our operations with microservices, because that gives us unparalleled insights about how our architecture operates in reality. We have engineers who design those systems and they think they know how the system works. The system behaves the way it wants to behave. We have so many services and all the interactions happening all the time, so tracing monitors that in real-time gives a very accurate picture of what happens today and not what you thought you designed in the system.

Questions and Answers

Participant 1: This OpenTelemetry combining OpenSensors and OpenTracing, which API it's going to choose?

Shkuro: It's a new API, it will provide backwards compatibility streams for both of those, so the existing instrumentation would work with OpenTelemetry.

Participant 2: Do you use some sampling strategies specifically in Uber or you trace every single request?

Shkuro: Yes, we do sample all the requests.

Participant 2: And what is the reasonable sampling configuration?

Shkuro: It's actually very varied because we have adaptive sampling which is sensitive to the type of service and the traffic that service generates. If you have low KPS service we'll sample probably 100% of it. If it's very high then it's going to decrease that.

Participant 2: What is operational complexity costs to manage all those traces, store this data?

Shkuro: There isn't much. It depends on how well you have infrastructure support in terms of storage because we had a bunch of features with Cassandra we had to manage it. Once we moved to manage Cassandra by another TM the operational challenge went away for us, almost. It depends on the maturity of the organization more than the tracing technology, by itself it’s not really that hard. We depend on Kafka and Cassandra, those are our main dependencies.

Participant 3: What's the maturity level of distributed tracing within Uber?

Shkuro: What I count as adoption is, how many services actually generate in traces. We're probably 70%, so it's pretty high but it's very hard to get to the long tail because we started tracing when there were already a lot of services. You better start when you're just designing the microservices architecture.

In my book, I have a whole chapter about how to do that, there's a lot of steps you want to take upfront to make sure that this is a sustainable process. Another aspect of maturity is how much usage the tracing gets and that I think we're still pretty new to that.

What's interesting is that tracing was originally created to solve performance problems. People think this is for a latency investigation, that was not the use cased at all at Uber ever. There are probably a few power users who use that for latency optimizations. I've seen examples of people doing that, but by far, that's not what Uber uses trace for. It's for understanding the complexity for root cause analysis. It's basically dealing with the microservices complexity, rather than, "I want to sort of start optimizing performance." I guess Uber is still a young company, we probably haven't reached the maturity where we actually need to drill down into performance issues that much.

Participant 4: In asynchronous call across the systems, how is the tracing metadata passed along? Is it part of the payload?

Shkuro: It can be, but internally at Uber I strongly advised against that. What we're doing is, we upgraded Kafka to 1.1 which supports records headers, and so the tracing context is passed in the headers so that you don't touch the payload, the business payload. Deploying it becomes much harder if you introduce it in all kinds of various business payloads. Metadata headers is the same principle as works for RPC calls and HTTP, etc. I would recommend sticking to that.

Participant 4: If you have different messaging system you would have to have custom logic to pass along this metadata?

Shkuro: No. There's no custom logic because people use standard drivers to write messages to Kafka, let's say. That driver has an instrumentation, a middleware in it which says, "I'm going to take trace context and package it into the record headers."


See more presentations with transcripts

Recorded at:

Sep 04, 2019