Bio Gil Tene is CTO and co-founder of Azul Systems. He has been involved with virtual machine and runtime technologies for the past 25 years. His pet focus areas include system responsiveness and latency behavior. Gil is a frequent speaker at technology conferences worldwide, and an official JavaOne Rock Star. Gil pioneered the Continuously Concurrent Compacting Collector (C4) that powers Azul's GC.
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Hi. I'm Gil Tene. I am the CTO and one of the co-founders of Azul Systems. I have been working with Java since there's been a Java, so 20 and a bit years. I have been working on Java virtual machines. Azul makes Java virtual machines and I have been working in that space for 13 years now. And I have done a whole bunch of other things that are not Java-related operating systems and drivers and big applications that either use Java or didn't over the years.
One of my pet areas to talk to people about, teach people about is performance and latency measurement and both techniques and metrics and mistakes and ways of doing it right. I have been doing a lot of that kind of speaking over the last three years or so with subjects that vary from a name like “How not to measure latency” to more management-friendly names like “Understanding latency and responsiveness”. I find that there is an accumulation of a lot of interesting mistakes, questions, needs, that keep adding to the material so I just keep the titles and change all the slides every time and hope that things are good.
So when I talk about latency, I look at latency and response time as synonymous things, but latency is the measurement that it takes an operation to go from point A to point B. Point A might be me requesting something until point B gets the response or it might be a message entering a service until a message leaves a service on the other end. But it's the latency through a system as observed from the outside. When I use the term that's what I mean. I specifically am not talking about the subcomponents of latency which are things like the service time of a specific op, the length of a wire or things like those. Those all add up to make up the latency that is experienced from the outside.
The way latency affects value varies dramatically by application and domain. At Azul we work with people across and extremely wide spectrum, everything from low latency trading on Wall Street where microseconds matter to human interactive online analytics where a question that's very complex is being asked, and it is going to take 15 seconds to crunch the answer but people don't want it to take ten minutes. So the range is extremely high and what the value is, is very different.
So it's probably intuitive to understand how in trading latency matters. Whoever gets to the better price first, whoever is able to figure out gaps and take advantage of them first, that's a direct win through latency. It is war and latency is the weapon. But there are many places in our much more fungible regular human lives were latency matters. Latency for people that interact with an application equates to happiness or it is probably the other way around. Bad latency equates to misery and it's how miserable you make people that determines how successful you are or not. Hopefully, in most of applications if they're miserable, then you don't succeed. I could see some sadistic applications where it's the opposite.
A great example of that would be online retail. It is a very easily demonstrable thing to show that when your retail website has better responsiveness, things are just quicker. Even when you're browsing, when you're checking out, it actually works. You don't end with the spinning wheel for three minutes and it also won't say “Don't switch web pages”. If you do that, people buy more. From the same site with the same content people buying more and that directly results in revenue from a website, revenue from a store. That's probably the best place where you see that. It's why we play nice music in stores because we want to make people happy and when they feel happier they buy more. That's a good example there.
Another example perhaps would be ad placement. The quality of the ad that you get to see when you browse and when I say quality of ad it's a quality that matters both to whatever is advertising but also to you. People are generally much more accepting of advertisements that are relevant to them. I don't want to be having things thrown at me that I don't care about. But if it's something that I care about, I might actually appreciate having it in front of my eyes. There's a time budget in which we have to choose the right thing to show people and however much we could do in the time budget, from a quality perspective, will make it a better choice. So being able to complete a more accurate match in a given amount of time creates extreme value for advertising. And we see that a lot as well.
Well, there are so many so many fallacies.
Robert: Two or three.
I run into a lot [of them]. In fact, some of the patterns in so many talks are to point out some of the common ones or the shocking ones or the ones I see people do all the time. I'll pick a couple. One key example is there is a human need to think that if we've looked at a distribution of data, that it gives us a feel for how the data behaves as a whole. And people will tend to want to look at latency at a lot of results and hope that the set of results have some sort of common behavior like a Gaussian distribution, a mean in a standard deviation. And often people will look at the data, summarize it with some basic statistical numbers and ignore the actual shape. It turns out that when you actually plot the shape of latency metrics not of one metric but a million of them. After the same thing, latency or response time almost never behaves like a normal distribution.
The actual behavior is typically strongly multi-modal one. There is a good mode and some noise around it. Then there's a bad mode which is probably still acceptable and some noise around that. And then there are some terrible modes and some noise around them but there is nothing in between. So it's not that there is there this center and some noise and you can not do regular statistical analysis with simple populations. And as a result, whenever we try and describe latency in these terms we're mischaracterizing it and then acting on a mischaracterization. I call that latency wishful thinking. We wish it behaved in a certain way. We summarize the numbers that would only be valid if it behaved that way. We throw away all the data and then we make business decisions based on what we did and they're almost always wrong.
Well, the other fallacy is that when we measure latency with either a monitoring system or a benchmark or load generator of sorts that is testing our system and see how it behaves, what we are actually getting as numbers has anything to do with latency or response time. This is not a truly universal statement. Unfortunately, it's very common. I would say, if I were a betting man, I would take the bet anytime that your numbers are wrong as opposed to right, and I would probably win more than 90% of the time.
Specifically, a lot of times when we see the word latency or response time in a report, what is actually being measured is the service time component of a response time. And when I talk about service time, there's a queuing theory definition of what that is. But there's a really simple way to explain it with coffee. If you go to a coffee shop and we can measure how long it takes from the time I ask for a cup of coffee until I get a cup of coffee, the person making the coffee takes a certain amount of time to process it and give me the cup. That is service time. But when I go to a coffee shop across the street here and I need a cup of coffee, the time it takes me to get a cup of coffee is response time and it starts when I stand in line, get to the shop. If I get to a shop that has nobody in it, and there's nobody in line, then response time and service time are very similar. I'm not waiting for anything. So all that gates me is the length of time somebody takes to do the operation. But the minute we have any amount of load, where two things could happen at same time, ten people could be waiting in line, a queue builds up, then response time is completely separated from service time.
Unfortunately, the tools that we use to both display and measure latency in response time today, the ones that report latency in response time are basically only reporting the small number that is the service time out of the time that you would experience if you were actually wanting that service. And the difference is huge. The effect on misery and happiness is huge in the difference. And people get a false sense that things are good when they are terrible. If a thousand people went to Starbucks across the street and wanted a cup of coffee, the service time would not change. It would still be however many seconds it takes to make one of these. The fact that there are thousand people in line and they're probably walking away and going to the competitive coffee shop across the street is not showing up in that metric.
So if we look at a breakdown from a queue theory point of view, you could say response time is wait time plus service time. Now, that's a simple view if you have one service and one queue. Often when we look at the real world, we have multiple services and multiple queues and what appears to be the response time in one side, it includes the service time, would be the service time from the outside to the next service. So let's use a concrete example. I can measure the service time of a database that might be going to disk to do some operation. Okay, I have the request. I made the query, it goes and looks at some things on disk and gives me to answer. That's service time, if I have a lot of people waiting it includes queuing time. But then if you look at that database, it's using a disk service and the disk has a service time which is when the head gets to the place or when I need to move ahead, that's how long it takes to get this but there's time until I get there, a time that I'm waiting for the head to move or whatever it is and that would be a form of queuing or delay in other forms.
So every layer you can have the service time of one layer include the queuing time inside that layer to another layer. When you're looking at response time or service time whenever you're measuring or reporting it, there are two basic ways to look at it. One of them is how long did it take me to do this. And the other one is how long does somebody take to get this from me. Those are the only two ways to look at it. Which one you are measuring depends on what you want to measure. The one of how long does it take me to do this is always only an internal concern. Nobody outside of your system cares how long it takes you to do something if it takes them a long time to get it.
A common example of a thing that you can think of as a queue or service time with a duality is a piece of wire. It takes a nanosecond to go through 30 centimeters of air and in the wire it's a little longer. And you could think of that as service time, meaning how long does it take to move light across this or you can think of it as the wire as a queue. And you're just sitting there, entering here, leaving there, it takes this long, right? It's that duality exists for almost any service time, wait time, kind of thing depending on whether you're looking from the inside or the outside of the thing.
When I look at fallacies though, the issue is almost always when people look at the number that is service time and think that's what people will observe. In my coffee cup example, it's not that service time is necessarily short. For example, the person making the coffee, the barista, might take a lunch break and this one cup of coffee might take an hour. We will know that there's one cup of coffee that took an hour. It's not that that would be hidden. There will be a very long service time for one cup of coffee but it would be one cup of coffee not the 300 other cups of coffee that also are waiting for this cup of coffee to complete. So it's the effect of how one long service time affects response time is pretty dramatic. The thing that builds up queues is response times that are longer than the gap between incoming requests.
I think that when people build systems with strong latency requirements, the latency requirements cover the spectrum of latency. I have not yet met a system that can explain why it's required to have a median latency of X and nothing more. That's not a set of requirements that's complete. So we really have not met a system that -- well, I've met some systems but most systems do not just have “I need a maximum of this and I don't care about anything else”. There are examples of hard real time systems that have that. A heart pacer needs to beat at a certain rate and the maximum has to be X and nothing else. It doesn't matter what else you do.
But in most systems when we look at latency requirements, we have a spectrum. How many are allowed to be this big? How many are allowed to be this big? How many are allowed to be this big? And what is the number that nothing is allowed to be bigger than? Depending on your application, there may be an absolute hard limit or there may be not. If you're in advertisement, what's going to happen if you don't serve 0.001% of requests? I mean you could make 0.001% more money. And you didn't. Big deal, right? But if it was 10%, you might care a lot.
So it's “What percent of what you do is allowed to be how bad” is usually how you would look at latency requirements. I like to use percentiles for that because they are both intuitive to most people all the way to non-technical people, finance people, CEOs and the rest; and they tend to be easy ways to describe a behavior that you want without using a million numbers. Two, three, maybe four levels of percentiles and how big each one is allowed to be is enough to state a requirement, to set requirements on latency, I believe. Whether that requirement ends up being in microseconds, milliseconds, seconds and hours doesn't matter. That's the application. But how many steps and what they are is the requirement.
So if I look at the latency spectrum as a whole -- and this is a very generic statement but you can look at experience -- I usually look at the range that I think of as the common case. The common case is probably median, maybe 90th or 99th percentile. Those are all common. That part of the spectrum tends to be dominated by what we generally think of as speed, the common case speed. So how long does it actually take to compute this when there is no contention, there's nothing in the way. I have to do a billion operations, it's going to take this long. I need to go do this. It's going to take that long.
Then we have what happens beyond the common case and the things that affect what happens beyond the common case have multiple different reasons. The first one, the first set comes from the fact that computers are not those continuous “running all the time, the same way”-things that we wish they were. Computers are interrupted all the time. They change their speed all the time. They lose attention and they look at something else and then come back to the shiny thing again. That's how computers actually work. Between every instruction we stall, between every context switch we stall, when we have a timer interrupt we stall, when we have a garbage collection event we stall, when we need to go and re-index the database we stall. That's a natural effect that all scopes and every time your application is doing that, it's causing not just a single operation to be long, but queuing up of operations if you're under load. So the fact that your actual pace of the machine is not constant, what it can do is not constant, then creates these effects that they are either small and momentary or they are large and take a long time to bleed down at higher and higher loads.
I think usually I look for what is it that could have caused those events. What's the weird thing, not the common thing that sometimes happen and then creates these and those weird things tend to get worse and worse with load, worse and worse, either bigger or more frequent or both. I obviously spent a lot of time on garbage collection because it's our specialty. It's what we spend a lot of time with and improving, and that's why we've started measuring all these. But it's not just garbage collection that does this. Power savings and context switches and as I said re-indexing the database are all good examples of things that would cause weird behaviors or larger behaviors. The high level way I think of all of those as a category is its accumulated debt that has to be paid.
Our common operations seem to be going at a certain speed, but most of the things that we do have a little bit of cost that we did not yet pay, that we've deferred to the future and at some point that cost will be taken back and how it gets taken back is a matter of design. But in many systems, it gets taken back by stalling. So for example, I might have been lucky enough to have the CPU all to myself for the last ten milliseconds. Now it's somebody else's turn for another ten milliseconds. I just don't get it for that amount of time. It's a complete stall. Your turn, my turn, your turn, my turn. I'm fast, I do nothing. I'm fast, I do nothing. That's one way.
Similarly, there are things that are a lot more incremental where I need to stop to do some bookkeeping like take a timer interrupt and see if there's anything that needs to happen now every few hundred microseconds. It's a tiny little operation. You don't even feel it but it does affect pretty big what you do. And a database re-indexing is a good example of a great accumulation, right? You've been doing a lot of things. The quality of your indexes has been diminishing. It is time to re-index the whole database. Whether you do this as a phasing operation or a background operation that's simply taking away resources is a matter of design. But in either case, you're paying for that for having been fast in the past.
And very commonly that's where we end up having those glitches. It's not all of them. Some of them are actually just noise. I think what we call noise is what we don't understand or we don't have an explanation for. But in many cases, we're paying for something we cannot continue without paying it and we're being called up on that. We have to do it then we can go fast again. Garbage collection is a great example. You've run really, really fast, got those nice free allocations. Now just pay by cleaning everything up and then you can go fast again.
9. Many of the things you describe are unavoidable properties of computer systems. If we're using Java, it's going to pause for garbage collection. What do you do if the pausing introduced by the physical properties are inherent built up tasks of the systems you're using exceed the SLAs for latency that you need to deliver?
There are a couple things in the question here. The first one is what do you do if the properties of the system exceed the SLAs? But the other one is the question of whether or not it's inherent. Azul is actually in the business of changing the answer to that question. We do not believe that it is inherent for Java to pause. We think it's just a 20-year-old mistake that nobody fixed. And it's not that hard to fix. We fixed it. It works. And in the universe of people who run their Java on Zing, our really cool JVM, that question is nullified at the core, right? Basically, no, Java doesn't pause there for the question isn't relevant anymore. But some people have to run Java on inferior JVMs. Some people have to run on computers that have other reasons for pausing and some SLAs are so stringent that regardless of Java or not and pauses or not there are other things that will happen.
To the core part of that question of what do you do if what you need, your requirements for SLA are more stringent than what your physical infrastructure can deliver? That's where a lot of fun hard design comes in at the system level. So the first thing to do and I actually recommend this to people a lot is go and check your premise of whether or not this is actually required. If you went to whoever told you that they really want you to answer every single request in two milliseconds or less and said, "I could do that but I'm going to need three years and one hundred million dollars, can you relax that a little bit? ”, you might be surprised. They might say 20 milliseconds is fine every once in a while and you change your requirements.
That's a very useful thing to do especially with business requirements that are usually start off with “Do really, really good all the time” until you give them a bill and then they said, "Okay, we could do a little worse." But assuming the requirements are real and you've verified them, you're left with having to make design choices that work around whatever limitations you have. For example, you might find that your SLAs fail under a certain load. But if you don't carry that much load, then they're still not perfect but they've met your requirements. That's actually a very common way to do capacity planning. It's not how many widgets per second my machine can do. It is how many widgets per second my machine can do without breaking my requirements, and that is the way to answer how many machines do I need.
Very commonly the way to address that is to use a lot more machines to do the same work so that the latencies are spread better towards the parts that you want and maybe you would not be breaking your requirements. That's one way. It doesn't always work. That actually usually doesn't help with the maximums but it certainly helps shift the percentiles around. There are other design techniques that you can use if you look at system architecture using redundant paths, using timeouts and retries. Those are all ways to work around issues or what you say is I need this answer in two seconds. Well, unfortunately, 99% work under two seconds but 1% take three seconds or more, right?
What you could do is set a timeout at half a second and try again. You have another half second to do it and another half second. If 99% are done in say half a second and only 1% are more than half a second, then 99.99% will be done in a second, 99.9999% will be done in a second and a half. You're up to the eight nines at the two seconds, you're pretty good probably and you might meet your requirements by doing that. If I'm not going to finish, start over. There are the redundant path techniques like I idempotent applications where let's send multiple requests and whatever comes back first is it. So as long as not everybody stalls at the same time and there's some decoupling there, I will get my answer in time. Those are all examples of working around the problem at the individual physical layer by building a more resilient or redundant system around it.
So last takeaways on the theme. Re-examine your beliefs and don't believe those nice pictures you see on the monitoring screens. It is sad to say but the millions of monitoring dashboards we are looking at are actually designed to take our mind off of the bad stuff and to hide the bad stuff somewhere else and tell us only good things. That is fine as a marketing technique. It's fine as a way of papering over problems but if what you want to know is how your system behaves, you can improve it. What you want to know is when something is bad so you can react to it then you should be focusing on the bad parts not on the good parts. And unfortunately, we have a human tendency to summarize the good and hope that the bad looks like that. So try to stop doing that. That's my soapbox preaching moment on latency.
Robert: Gil Tene, thank you very much.