Transcript
Martin: Welcome to the Bare Knuckles Performance track, and to my talk about high-resolution performance telemetry at scale. First, a little bit about me. I'm Brian [Martin], I work at Twitter. I'm a Staff Site Reliability Engineer. I've been at Twitter for over five years now. I work on a team called IOP, which is Infrastructure Optimization and Performance. My background is on telemetry, performance tuning, benchmarking, that kind of stuff. I really like nerding out about making things go faster. I'm also heavily involved in our Open Source program at Twitter. I have two Open Source projects currently, hopefully, more in the future. I also work on our process of helping engineers actually publish open-source software at Twitter.
What are we going to talk about today? Obviously, high-resolution performance telemetry, but specifically we're going to talk about various sources of telemetry, sampling resolution, and the issues that come along with it. We're going to talk about how to achieve low cost and high resolution at the same time, and some of the practical benefits that we've seen at Twitter from having high-resolution telemetry.
Telemetry Sources
Let's begin by talking about various telemetry sources. What do we even mean by telemetry? Telemetry is the process of recording and transmitting readings of an instrument. With systems performance telemetry, we looked at instrumentation that helps us to quantify run-time characteristics of our workload, and the amount of resource that it takes to execute. Essentially, we're trying to quantify what our systems are doing, and how well they are doing it. By understanding these aspects of performance, we're then able to optimize for increased performance and efficiency. If you can't measure it, you can't improve it.
What do we want to know when we're talking about systems performance? In Brendan Gregg's "System Performance" book, he talks about the USE method. The USE method is used to quickly identify resource bottlenecks or errors. USE focuses on the closely related concepts of utilization and saturation, and also errors. When we're talking about utilization and saturation, they are different sides of the same coin. Utilization directly correlates to making the most use of the amount of resource that our infrastructure has. We actually want to maximize this, within the constraint of saturation though. Once we have hit saturation, we have hit a bottleneck in the system, and our system's going to start to have degraded performance as load increases. We want to try and avoid that, because that leads to bad user experience. Errors are maybe somewhat self-explanatory. Things like packet drops, retransmits, and other errors can have a huge impact on distributed systems, so we want to be able to quantify those as well.
Beyond USE, we begin to look at things like efficiency. This is a measure of how well our workload is running on a given platform or resource. We can think of efficiency in probably a variety of ways. It might be the amount of QPS per dollar that we're getting. Or, we might be trying to optimize a target like power efficiency, so trying to get the most workload for the least amount of power expense. The other way to look at efficiency is how well we're utilizing a specific hardware resource. We might look at things like our CPU cache hit rate, our branch predictor performance, etc. Another really important aspect of systems performance, particularly in distributed microservice architecture stuff, is latency though, which is how long it takes to perform a given operation.
There are a bunch of traditional telemetry sources that are generally exposed through command-line tools and common monitoring agents. These help us get a high-level understanding of utilization, saturation, and errors. Things like top, like the basic stuff that you would start to look at when you're looking at systems performance. Brendan actually has a really good rundown of systems performance diagnostics in 60 seconds or something like that. These are the types of things that you might use for that.
You are looking at things like the CPU time running, which can tell us both utilization and saturation, because we know that the CPU can only run for the amount of nanoseconds within that second for that number of cores that you have. We also know that disks have a maximum bandwidth, and a maximum number of IOPS that they are able to perform. Network also has limits there. Network is where it becomes obvious when there are errors, because you have these nice little like protocol statistics, and packets dropped, and all sorts of wonderful things. These are the traditional things that a lot of telemetry agents expose.
The next evolution is starting to understand more about the actual hardware behaviors. We can do this by using performance counters, which allow us to instrument the CPU and some of the software behaviors. Performance events help us measure really granular things, like we can actually count the number of cycles that the CPU has run. We can count the number of instructions retired. We can count the number of cache accesses, at all these different cache levels within the CPU.
This telemetry is more typically exposed by profiling tools, things like Intel VTune and the Linux tool perf. However, some telemetry agents actually do start to expose metrics from these performance counters. They become really interesting if we're starting to look at how the workload is running on the hardware. You can start to identify areas where tuning might make a difference. You might realize that you have a bunch of cache misses, and that by changing your code a little bit you can start to actually improve your code's performance dramatically.
Performance events for efficiency. We can start to look at things like power. We can look at cycles per instruction, which is how many clock cycles it actually takes for each assembly instruction to execute. CPI actually is a pretty common metric to start to look at how well your workload is running on a CPU. You can look at, as I said, cache hit rates. You can also look at how well the branch predictor is performing in the CPU. Modern CPUs are crazy complicated, and they do all sorts of really interesting stuff behind the scenes to make your code run fast. On modern hyperscaler processors, we should actually expect less than one cycle per instruction, because they are able to retire multiple instructions in parallel, even on the same core, but in reality that's rare to see on anything but a very compute-heavy workload. Anytime you're accessing things like memory, it's more likely that you're going to see multiple cycles per instruction, as the CPU winds up essentially waiting for that data to come in.
Then, we have eBPF, which is really cool and it gives us superpowers. It's arguably one of the most powerful tools that we have for understanding systems performance. Really, I think a lot of us are just starting to really tap into it to help get better telemetry. eBPF is the enhanced Berkeley Packet Filter. As you might guess, it evolved from tooling to filter packets, but it's turned into a very powerful tracing tool. It gives us the ability to trace things that are occurring both in kernel space and userspace, and actually executes custom code in the kernel to provide summaries, which we can then pull into userspace. While traditional sources can measure the total number of packets transferred, or total bytes, with eBPF we can actually start to do things like get histograms of our packet size distribution. That's really exciting. Similarly, we are able to get things like block I/O size distribution, block device latencies, all sorts of really cool stuff, that help us understand down to actual individual events. We're doing actual tracing with eBPF. It gives us a really interesting view about how our systems are performing.
eBPF is very powerful for understanding latency. We can start to understand how long runnable tasks are waiting in the run queue before they get scheduled to run on a CPU. We can actually measure the distribution of file system operation latencies, like read, write, open, fsync, all that stuff. We can measure how long individual requests are waiting on the block device queue. That helps us to evaluate how different queueing algorithms actually wind up performing for our workload. We can also measure things like outbound network connect latencies, the delta end time between sending and getting a SYN-ACK back. All very cool stuff.
eBPF is also insanely powerful for workload characterization, which is an often neglected aspect of systems performance telemetry. It becomes very difficult to work with a vendor if they pose the question of, "What are your block I/O sizes?" and you're just, "I don't know. It's this type of workload, I guess." You actually need to be able to give them numbers, so they can help you. With eBPF, you can start to get things like the block I/O size distribution for read and write operations. Now, you can actually be like, "It's not 4K. It's actually this size." You can give them the full distribution, and that really starts to help things. You can also look at your packet size distribution, so you can understand actually, if you're approaching full MTU packets, or if jumbo frames would help, or if you're just sending a lot of very small packets. A lot of my background is from optimizing cache services, like Memcached and Redis type stuff, where we have very small packet sizes generally. Yes, it can be interesting that you can start to hit actual packet rate limits within the kernel, way before you hit actual bandwidth saturation.
Sampling Resolution
One of the most critical aspects about measuring systems performance is how you're sampling it, and how often you're sampling it. It all comes down to sampling resolution. The Nyquist-Shannon theorem basically states that when we're sampling a signal, there is an inherent bandwidth limit, or a filter, that's defined by our sampling rate. This imposes an upper bound on the highest frequency component that we can capture in our measured signal.
More practically, if we want to capture bursts that are happening, we need to sample at least twice within the duration of the shortest burst we wish to capture. When we're talking about web-scale distributed systems type things, 200 milliseconds is a very long time actually. A lot of users will not think your site is responsive if it takes more than 200 milliseconds. When we're talking about things like caches, 50 milliseconds is a very long time. You start to get into this interesting problem where now things that take a very long time on a computer scale are actually very short on a human scale. You actually need to sample very frequently, otherwise, you're going to just miss these things that are happening.
In order to demonstrate the effect that sampling rate has on telemetry, we're going to take a look at CPU utilization on a random machine I sampled for an hour. There it is. This is sampled on a minutely basis, and is just the CPU utilization. The actual scale doesn't matter for the purpose of our talk here. We can tell from this that our utilization is relatively constant. At the end, there's like this brief period where it's a little bit higher and then comes back down to normal. I would say that other than that, it's pretty even.
No, that's not actually how it is. Here, we're seeing secondly data in the blue, with the minutely data in the red, and we now see a much different picture. We can now see that there are regular spikes and dips below this minutely average that we are otherwise capturing. Time series is also a lot fuzzier just in general, and arguably a lot harder to read. I wouldn't want to look at this from thousands of servers at that resolution. It just becomes too much cognitive burden for me as a human.
Here, we're actually looking at a histogram distribution of those values, normalized into utilization nanoseconds per second thing, so they're comparable. Really, the most interesting thing is about the skew of these distributions. The minutely gives you this false impression that it's all basically the same. The reason why the chart looks very boring to the righthand side is because that secondly data has a very long tail off to the right that just doesn't show with this particular scaling. A long scale might have been better here. The secondly data, even though it's right-skewed, it's more of a normal distribution, whereas the minutely data is weird-looking.
This problem isn't restricted to just sampling gauges and counters. It can also appear in things like latency histograms. rpc-perf, which is my cache benchmarking tool, has the ability to generate these waterfall plots. Essentially, you have time running downwards, and increased latency to the righthand side of the chart. The color intensity sweeps from black through all the shades of blue and then from blue all the way to red, to indicate the number of events that fell into that latency bucket.
Here, we're looking at some synthetic testing, with rpc-perf against a cache instance. In this particular case, when I was looking at minutely data, my p39 and p49 was higher than I really expected it to be. When I took a look at this latency waterfall, you can see that there are these periodic spikes of higher latency that are actually skewing my minutely tail latencies. To me, this indicated that there was something we needed to dig in and see what was going on. It's pretty obvious that these things are occurring on a minutely basis, and always at the same offset in a minute. These types of anomalies or deviations in performance can actually have a very huge impact on your systems.
How can we capture bursts? We just increase our sampling resolution. This comes at a cost. You have the overhead of just collecting the data, and that's pretty hard to mitigate. Then, you also need to store it and analyze it. These are areas where I think we can improve.
Low Cost and High Resolution
How can we get the best of both worlds? We want to be able to capture these very short bursts, which necessitate a very high sampling rate, but we don't want to pay for it. We need to think about what we're really trying to achieve in our telemetry. We need to go back to the beginning, and think about what we're trying to capture, and what we're trying to measure. We want to capture these bursts, and we know that reporting percentiles for latency distributions is a pretty common way of being able to understand what's happening at rates of thousands, millions of events per second, so very many.
Let's see what we might be able to do by using distributions to help us understand our data. Instead of a minutely average as the red line here, we actually have a p50, the 50th percentile, across each of these minutes. This tells us that half the time within the minute the utilization is this red line value or less, but also that it's higher than this the other half of the time. It actually looks pretty similar to the minutely average, just because of how this data is distributed within each minute.
If we shift that up to the 90th percentile, we're now capturing everything except those spikes. This could be a much better time series to look at if we're trying to do something like capacity planning. Particularly when you're looking at network link utilization, you might have 80% utilization as your target for the network, so you can actually handle bursts that are happening. Looking at the p90 of a higher resolution time series might give you a better idea of what you're actually using, while still allowing for some bursts that are happening outside of that.
Here, we've abandoned our secondly data entirely, but we're reporting at multiple percentiles. We have the min, the p10, 50, 90, and the max. We can really get a sense of how the CPU utilization looked within each minute. You can tell that we still have this like period of increased CPU utilization at the end. Now, we're actually capturing these spikes. Our max is now reflecting those peaks in utilization. We can also tell that generally, it's a lot lower than that. The distance between the max and the p90 is actually pretty intense. We can actually start telling that our workload is spiky just based on the ratio between these two-time series. Particularly bursty workloads might be an indicator that there's something to go in and look at, and areas where you could benefit from performance analysis and tuning.
The other really cool thing about this is that, instead of needing 60 times the amount of data to get secondly instead of minutely, now we only need five times the amount of data to be stored and aggregated, and we have still a pretty good indicator of what the sub-minutely behaviors are. Also, this is a lot easier to look at as a person. I can look at this and make sense of it. The fuzzy secondly time series is very hard, especially with thousands of computers. To me, this winds up being a very interesting tool for understanding our distribution without burning your eyes out looking at charts.
How can we make use of this? We want to be able to sample at a high resolution, and we want to be able to produce these histograms of what the data looks like within a given window. You're doing a moving histogram across time, shoving your values in there, and then exporting these summaries about what the percentiles look like, what the distribution of your data looks like.
This leads me to Rezolus. I had to write a tool to do this. Rezolus does high-resolution sampling of our underlying sources. There is a metrics library that's shared between Rezolus and my benchmarking tool, rpc-perf. It's able to produce these summaries based on data that's inserted into it. Rezolus gives us the ability to sample from all those sources that we talked about earlier, our traditional counters about CPU utilization, disk bandwidth, stuff like that. It can also sample performance counters, which gives us the ability to start to look at how efficiently our code is running on our compute platform. It has eBPF support, which is very cool for being able to look at our workload and latency of these very granular events.
Rezolus is able to produce insightful metrics. Even if we're only externally aggregating on a minutely basis, we get those hints about what our sub-minutely data distribution looks like. The eBPF support has been very cool for helping us to actually measure what our workloads look like, and be able to capture the performance characteristics. These types of things would be unavailable to us otherwise. There is just no way to do it except to trace these things. eBPF is also very low overhead just because it's running in the kernel, and then you are just pulling the summary data over, so it winds up being fairly cheap to instrument these things that are happening all the time. We're able to measure our scheduler latency, what our packet size distribution is, our block I/O size distribution, very important stuff when you're talking to a hardware vendor to try to optimize your workload on a given platform, or the platform for the hardware.
Rezolus is open-source. It's available on GitHub. Issues and pull requests are welcome. I think that it's been able to provide us a lot of interesting insight into how our systems are running. It's helped us to capture some bursts that we wouldn't have seen otherwise. I think it's useful even for smaller environments. I've worked in small shops before I worked at Twitter, and I really wish I had a tool like this before. At small shops, often you don't have the time to go write something like this, so it's really great that Twitter allowed me to open source something like this so it can become a community thing. Oftentimes at smaller shops, your performance actually matters a lot, because you don't necessarily have a huge budget to just throw money at the problem, so understanding how your system is performing and being able to diagnose run-time performance issues is actually very critical. At Twitter's scale, even just a percent is a very large number of dollars, so we want to be able to squeeze the most out of our systems.
How Has It Helped in Practice?
In practice, how has Rezolus helped us? Going back to this terrible latency waterfall. We actually saw something like this in production. We measured CPU utilization when we were looking at it in production. Periodically, the CPU utilization of the cache instances was bursting. Rezolus, in addition to providing this histogram distribution thing, also looks for the peak within each rolling interval, and will report out the offset from the top of the second that the peak occurred. That can be very useful to start correlating with your logs and your tracing data within your environment, and help to really narrow down what you're looking at.
Rezolus was able to identify this peak in CPU utilization, which was twice what the baseline was, and it had a fixed offset in the minute. We narrowed it down to a background task that was running and causing this impact. For the particular case of cache we were able to just eliminate this background task, and that made our latency go back down to normal.
Rezolus has also helped us detect CPU saturation that happened on a sub-minutely basis. This is something where if you have minutely data, you're just not going to see this at all, because it winds up getting smoothed out by this low pass filter that the Nyquist-Shannon theorem talks about. Rezolus was able to detect this in production, and capture and reflect the CPU saturation that was happening. That helped the backend team actually go in and identify that our upstream is sending us this burst of traffic at that period of time. They were able to work with the upstream service to smooth that out and have it not be so spiky.
Summary
In summary, there are many sources of telemetry to understand our systems performance. Sampling resolution is very important, otherwise you're going to miss these small things that actually do matter. In-process summarization can reduce your cost of aggregation and storage. Instead of taking 60 times the amount of data to store secondly time series, you can just export maybe five percentiles. The savings becomes even more apparent when you are sampling at even higher rates, you might want to sample every hundred millisecond, or 10 times per second, or something like that. You might want to sample even faster than that for certain things. Really, it all goes back to what is the smallest thing I want to be able to capture. As you start multiplying how much data you need within a second, the savings just becomes even more. Having high-resolution telemetry can help us diagnose run-time performance issues as well as steer optimization and performance tuning efforts. Rezolus has helped us at Twitter to address these needs, and it's, again, available on GitHub.
Questions and Answers
Participant 1: You mentioned this sub-second sampling. What are the tools you use to sample, say, CPU utilization in microseconds or disk usage in microseconds? What are those basic tools that you are using?
Martin: Those basic or traditional telemetry sources are exposed by procfs and sysfs, so you are able to just open those files and read out of them periodically.
Participant 1: Ok, procfs, but then you need root access maybe, or is it something that any person can access?
Martin: That kind of stuff actually does not need root access. When you start looking at things like perf events, you do need sysadmin privileges, essentially. You do need root access. Or, for the binary to have cap sysadmin would be the capability. eBPF also requires high-level access, because you're injecting code into the kernel. Actually, there's something in, I forget what recent kernel version, but they're starting to lock down things like that, and by default, it seems like they're still trying to work out how to deal with that exactly.
Participant 1: Another question is, how do you do it? I don't think on production you can do it. You have to come up with a system that is very similar to production and where you run these tests to make sure that, or are you running on the production itself?
Martin: We run Rezolus on our production fleet. Having the tooling so we're able to rapidly identify run-time performance issues, and give teams insights to help diagnose those issues, and root cause them and resolve them, is definitely worthwhile. Rezolus winds up being not super expensive to run, actually. It takes about 5% of a single core, and maybe 50 megabytes of resident memory to sample at, I believe that's for 10 times per second, with all the default samplers. So none of the eBPF functionality, but all of the perf, and all of the traditional systems performance telemetry, in what I think is a very small footprint. Definitely the insight is worth whatever cost that has to us.
Participant 2: In your example, does Rezolus record the diagnostics, once something is abnormal in terms of statistics? If it does, what's the footprint looks like, and the cost of it?
Martin: That's a very interesting question about whether Rezolus records as it detects anomalies. That's something that we've been talking about internally, and I think would be a very cool feature to add. One could imagine Rezolus being able to use that metrics library to easily detect that something abnormal is happening, and either increase its sampling resolution, or dump a trace out. Rezolus winds up in a very interesting position in terms of observability in telemetry and stuff like that, where it could pretty easily have an on-disk buffer of this telemetry available at very high resolution. This hasn't been implemented yet, but is definitely a direction that I want to be able to take the project in.
Participant 3: Is it possible to extend Rezolus to provide telemetry on business KPIs as opposed to just systems KPIs?
Martin: The answer is yes. Rezolus also has this mode that we use on some of our production caches, the Memcache compatible servers. We use Twemcache at Twitter, which is our fork. There's a stats thing built into the protocol. Rezolus can actually sit in between, and sample at higher resolution the underlying stats source, and then expose these percentiles about that. We actually use that on some of our larger production caches, to help us capture peaks in QPS, and what the offset is into the minute. Yes, one could imagine extending that to capture data from any standard metrics exposition endpoint, and ingest that into Rezolus, and then expose those percentiles.
Participant 3: Is there some development work that's needed to do that, or can it do it as-is?
Martin: There would be probably some development work needed that would actually be, I would call that almost trivial. If one were familiar with the codebase, it would definitely be trivial. Yes, there's already a little bit of that framework in there, just from being able to ingest stats from Twemcache, but it would need to be extended to pull from HTTP and parse JSON, or something like that, whatever the metrics exposition format is. This is actually another thing that we've been talking about internally, because Twitter uses Finagle for a lot of things, and our standard stack exposes metrics via an HTTP endpoint. The ability to sit in between our traditional collector and the application and provide this increased insight into things, without that high cost of aggregating, is very interesting to us. That I think is work that is likely to happen, although there are some competing priorities right now. That would be the stuff that I would definitely welcome a pull request on.
Participant 4: I have just a followup on the CPU and memory overhead of Rezolus itself. Do you consider that low enough that you can freely run it all the time on all the prod servers? Do you prefer to have a sample set of servers? Or would either of those approaches have an impact on the data volume and counts for you [inaudible 00:42:31]? What's your approach to [inaudible 00:42:34] in production and when do you [inaudible 00:42:36]?
Martin: The question was about the resource footprint of Rezolus and the tradeoff between running it all the time and getting that telemetry from production systems versus the cost and do we do sampling, or run it everywhere all of the time.
My goal with the footprint that it has is that it could be run everywhere all the time in production. We always have telemetry agents running. I think that Rezolus can actually take over a lot of what our current agent is doing, and fit within that footprint, and still give us this increased resolution. Yes, I'm trying to think of what the current percentage of rollout is, but the goal is to run it everywhere all the time. In some resource-constrained areas, so if you think about like containerization environments, where you have basically a small system slice dedicated that isn't running containers, that's where things become a little more resource-constrained. Again, I believe that Rezolus can help take away some of that work that's being done, and will be able to actually get rid of existing agents. The resource footprint of things like Python telemetry agents and stuff like that, it just gets really unpredictable. Rezolus is in Rust, so the memory footprint is actually very easy to predict. The CPU utilization is pretty constant. It doesn't have any GC overheads, or stuff like that. I think that we'll be able to shift where we're spending resource, and be able to run Rezolus everywhere all the time. I think that we will be looking into things like dynamically increasing the sampling resolution, having an on-disk buffer, being able to capture tracing information, possibly integrated into our distributed tracing tool, Zipkin, stuff like that, and be able to start tying all these pieces together. Everywhere all the time.
Participant 5: You use Rezolus, you find a spike in performance. Then, how do we magically map that back to what's going on? What application's causing the performance?
Martin: That is where things become more art than science. Typically, I've had to go in and trace on the individual system and really look to understand what's causing that. We can look at things like which container is using the resource at that point of time. I think at some point you wind up falling back onto other tools while you're doing root cause analysis. Those might be things like getting perf-record data. That might be things like looking at your application logs, and just seeing what requests were coming in at that point in time, or doing more analysis of the log file.
See more presentations with transcripts