Transcript
Pepperdine: I'm going to tunnel through a very specific problem. That's just memory. Did you guys get to the keynote this morning? I thought that keynote rocked. It is probably one of the better wake up calls I've seen at a keynote in quite some time. Everything this guy was saying was spot on. There are a few areas I wished he had gone deeper in, but I'm not going to complain about that. It was really awesome. If you were at Brian Goetz's talk, you could see that a portion of his talk was also really there on this memory issue. The memory issue is actually very broad. I'm just going to narrow in on one part of it, but that's generally what we're going to talk about here.
That's my marketing slide that we have to put in here. We've co-founded jClarity, where we're producing what we call a diagnostic engine. We're just trying to bring predictability into the performance diagnostic process. I co-founded JCrete which we call the "hottest unconference on the planet." It's every July in Crete, and you can imagine the weather there is nice and hot. Yes, we do have sessions on the beach in case you're interested.
What is Your Performance Trouble Spot?
The question is, what is your performance trouble spot? What do you guys think is your primary source of performance problems? Just think about that yourself. This is a survey that a friend of ours did a couple of years ago, it's probably still valid. What are we looking at here? We're looking at the answer to that question, what is your performance trouble spot? If you look at it, you're probably going to recognize things. Slow database queries - everybody that's thinks slow database queries are the problem? I guess you have to ask first, who's using databases still? They're still useful, apparently.
You see there are three big bubbles really about inefficient application code. The point is, there are all these really big bubbles about things; bubbles around all these different performance problems. We get data from tons of systems across all industries from all corners of the planet. What do we find as the biggest problem? It's this thing down here, excessive memory churn. We actually see that 70% of Java applications are bottlenecked on memory churn.
The question is, if my application is bottlenecked on memory churn, why is it that it's not showing up in this chart? This is really a statement on the state of observability of our systems today. The observability in our systems is highly biased. We've always had problems with database queries, so what do we do with this problem? We put a lot of instrumentation around it, and we log. We use these wonderful things called logs, and then the monitors we're using, "We got a slow database query," so we need to go back in there and retune that bit.
Then maybe something else happens, and you get a regression, and after however long it takes you to go through the whole process, you end up going, "That's the problem. Let's put logging around that." Then you find the next problem, "Ok. Let's put logging around that." Then you end up with this highly biased view of your system which is based on what your historical performance problems have been, but as you can notice, the observability doesn't actually give you that much insight into what's happening in the here and now. That's a real problem.
The other problem is that if you apply an instrument to a system, it's always going to tell you something, and you're probably going to act on whatever it is that it's telling you. The question is, is that really the problem? The other problem in this case, as you can probably notice, is with memory churn. Does anyone get an idea of how you might measure memory churn in your application?
Participant 1: Page fault delta.
Participant 2: Profiling in a production system.
Participant 3: On Sirius.
Pepperdine: Profiling for what, though? You're going to profile for problems that you think you have. You're not generally profiling for problems you don't believe you have. That's another issue.
Participant 4: Profiling for target performance.
Pepperdine: Yes.
Participant 5: What you think the system should be getting.
Pepperdine: We're getting a lot of answers here. I'll give you one of our key performance indicators in this particular area, it’s just simply garbage collection logs. That's it, just doing a simple analysis on the garbage collection logs will give you an inference. It's not an exact figure, but it'll give you some idea of what's happening in terms of memory churn inside your application. It's as simple as that, it's an analysis people don't do, so they absolutely don't see it.
The point is, this is what we're seeing. When we get garbage collection logs, since I tune a lot of garbage collectors - I think this year was low, it only got to about 1,000; in previous years, I'd go through 2,000 to 3,000 different JVMs a year - you get a lot of data from these things. This is a problem that once you start seeing it, you just can't stop seeing it. It just shows up absolutely everywhere.
Garbage collection is not really at fault here, this is really just your application code. Then you get a lot of contributions from different things. For instance, you use Spring Boot, Cassandra, or any other big NoSQL solution, Spark as an example, and all the derivatives from that, or data frameworks like Log4J, or any Java logging framework for that matter. Of course, I bet you nobody is using this, [JSON] or basically, any Marshaling protocol that you're going to use, and I'm going to include SQL statements and things like that as a marshaling protocol, albeit an accidental one. Also caching products, Hibernate, and so on.
You can see there are a lot of these things that we actually just embed into our application that are just going to magically drop you into the 70% without you actually even realizing what's going on. These things can actually make a difference. We all have war stories - here are the war stories, you can read them. I don't really want to go through them too much, the point is that we've gotten some tremendous performance improvements simply by looking at the memory efficiency, the algorithms that people are using, identifying hot allocation sites in the code, and doing something to basically destroy them, or reduce them, or remove them from the code.
It's like anything else. Once you can see these things, then you can design some tests to figure out where this stuff is happening. After that, we can flip from diagnostic mode into thinking creatively about how we can solve these problems. Generally, a lot of these problems are very easily solved. A lot of this work that was done where we're getting massive improvements in performance were happening within a period of less than eight hours of work - real work - believe it or not. They can happen very quickly once you see where the problems are.
Allocation Site
The question is, what does an allocation site look like? If I have this piece of code here, and I decompile it, you can actually see the bytecode that's actually generated. These allocations are mostly going to occur in Java heap and they can occur in a couple of different ways. Generally, what we have in the JVM are these things called allocators, and the allocators are going to look at the allocation itself and make some decisions as to how I'm actually going to perform that allocation. I can go down a slow path, fast path. It's even possible, if I have small objects of the right shape, that the optimizers are going to come along and just basically get rid of the allocation site on its own. We don't even have to do anything. What's going to happen is that it's just going to say, "I know what to do with this, an on-stack allocation," and then we're not doing an allocation out in heap.
Since we're talking about Java heap, I just want to very briefly give you a review of what this tends to look like. Of course, this does change with Shenandoah GC and G1GC, but it's approximately the same. We're going to have some nursery where the allocations are going to occur. We're going to have some intermediate area where we can just store things temporarily so that we can figure out if they're going to be long-lived or short-lived. Then we're going to have this tenured space where we're going to put the data that is going to hang around for a long period of time.
Each of these spaces contributes to a different set of problems, and all are going to affect GC overheads. That's one of the problems that we're going to have. When we actually do an allocation here, we're going to have this thing called a top of heap pointer. When we go to execute some code, what we're going to see is we are going to create a new Foo, a new Bar, and a new byte array. Essentially, what we're doing is we're just going to say, "Foo is this big" plus a header. We're going to grab that much space by doing a bump and run on this top of heap pointer. We just move the pointer forward and there is where our allocations occur.
This creates a hot pointer. What we've tended to do here is we said, "Instead of just getting enough memory for the single allocation, let's go out and get a large chunk of memory, and then I'm just going to do my allocations in this buffer. Then once I filled the buffer, then I'll go get another one." This is known as a thread-local allocation block. These things start off as being one megabyte in size, and they'll be adjusted depending upon some heuristics about how your thread is allocating or not allocating, as the case may be.
If you look at it this way, it's pretty much the same thing is happening. You have a TLAB and you have TLAB pointer. Then as we allocate, and we have more threads allocating, and then you can see the Foo, the Bar, is going to go into the TLAB. The byte buffer doesn't fit into a TLAB, so that goes into a global heap allocation. This does affect the speed of the allocation, so the Foo is going to go in quickly, the Bar is going to go in quickly, the byte is going to take a bit longer. There are also some failure conditions here that you have to consider, like what happens when we get to the end of the buffer.
We're going to have a TLAB waste percentage, which says, "If I've allocated beyond that line, don't bother trying to use the rest of the buffer. Just go and get a new one." That adds a little bit of overhead, but that's not as bad as, "We're below the buffer and next allocation is going to basically give us a buffer overflow, which means now I have to go and do more work." It's the case of, "If we can predict something, that's going to be cheaper than if we have to react to something." It's always cheaper to predict than to react. This is a somewhat expensive failure path, it means, "Go get another TLAB and start allocating in the TLAB."
When we get into the generational collectors - probably not as much in use anymore as people have moved on to G1, and hopefully we'll leave all of these older collectors behind - everything tends to happen in the free list. The allocation is going to happen from the free list, and recovery is going to happen to the free list, and there's going to be a lot of free list management here. I don't want to go too deeply in it, but essentially the garbage collector threads, when they copy data into tenured space from the young generational space, they're going to use this thing called PLABs which effectively work the same way as TLABs. So they're going to have multiple threads acting in the space by basically segregating the space up between the threads.
The problem with the free list maintenance that you have to do when you're garbage collecting is that it increases the cost of the allocation and the memory management and by a factor of approximately 10X. You don't really want to be working in this space too frequently, which is why we have a young generational space.
Problems
Now, we get into problems here that we see when we have high memory churn rates. We're going to get many temporary objects, and that's going to quickly fill Eden. That's going to increase the frequency of the young GC cycles. It actually has this really strange effect, if you look at data and you ask, "What is the age of data?", the age of data right now is calculated by the number of GC cycles it survives. If your GC cycle frequency is increased, all of a sudden, the aging of your data has increased. Now, you have this other problem, which is almost orthogonal, basically, your application is going to create a piece of data and it's going to hold on to that piece of data for a period of time - wall clock time.
You see where we're going with this, I'm still probably going to hold on to the data for approximately the same amount of time, but since the GC frequency is increased, all of a sudden, that data which may have been garbage collected in young generational space is now going to end up in tenured space more frequently. You can see it has this whole funny effect, strange effect, on what your garbage collection overheads are going to look like. You're going to get premature promotion as another problem, because the data is aging too quickly. Also, you're creating more of it, so you're going to fill the buffers. You have to clean them out more, which means more data moves into tenured. You get more activity up into the expensive spaces to maintain. You get all of that added cost, plus you get all the copy costs and everything like that, and also increased heap fragmentation.
You get all of these funny downstream costs that you don't even think about. In terms of the allocation, it's still quick. If the objects die very quickly, there's zero cost to collect them, so that's true. That's what garbage collection people have been telling you all the time, "Go, don't worry about it. Just create objects. It's free to collect them." It may be free to collect them, but quick times a large number does equal slow. If you have high creation rates, it's not free to create. It may be free to collect, but it's not free to create at the higher rate.
We actually have this curve here, and so we've worked this out by looking at systems, a lot of systems over time. I have these bands here, the red band is approximately one gigabyte per second. Think of it this way, I'm consuming memory at approximately 100 bytes at a time. If I can reduce that allocation rate, you can see that we have all kinds of benefits. First off, we get rid of the allocation costs altogether, and we basically reduce costs from garbage collection. This is the bands that we'd like to hit. Ideally, I'd like to be below 300 megabytes per second. If I'm anywhere below a gig, though, I'm probably happy. I'm not going to really spend much time in this area. If I'm above a gig, I know that I can get significant improvements in performance just by looking for allocation hotspots and trying to improve the memory efficiency, or reduce the memory complexity, of this particular application.
Another problem that we run into sometimes is just simply large live data set size. What's a live data set size? That's the amount of data that's consistently live in your heap. What we actually see is that with the high allocation rates, we're going to get inflated live data set sizes. There's also another reason we can have live data set sizes, that's loitering. If you think of things like HTTP session objects, people attaching data to them, they hang around for 30 minutes. That's data that might not be semantically useful to the application, but it's just cached there and it's just sitting in memory, doing nothing. That's the term I call loitering.
You have these things loitering around for longer, and they're going to have also a lot of costs with that. You get an inflated scan for root times, so we need to find the things that are, by definition, live. When we start the garbage collection cycle, we need to do the scan for roots. Having all of this extra live data actually means that our scan for root times can sometimes be the dominant cost in the collection. We get reduced page locality which comes with its costs, inflated compaction times, increased copy times, and you're likely to have less space to copy to and from to. If you've done a Windows defrag on a half-empty disk, it's easy, and when the disc gets full, it gets harder and takes much longer. Same type of problem here.
If you look at the pause time versus occupancy, in the chart on the left is a lot of noise and stuff there, but just look at the red dots and imagine the line that they form. That's the first thing we're looking at, that's the heap occupancy. The other dots on the other chart are pause times. You can see the clear correlation between heap occupancy and pause times. This makes sense, the more live data I have, the more work the garbage collector has to do. Since the dead stuff is free and there's no free lunch, the cost has to come from someplace, and so it's coming from the live stuff in this case.
Of course, just for slight completeness, not complete completeness, unstable live data set size, or what we call memory leaks, obviously lead to the bad result of having out of memory errors where the JVM will terminate, if you're lucky. Does anyone know the definition of an out of memory error? It's 98% of recent execution runtime spent in a garbage collection with less than 2% of heap recovered. You can end up in these weird conditions someplace where your JVM is just chugging along really slow. You don't get an out of memory error, and you don't get the out of memory errors because maybe you're collecting 3% of heap, not 2%. In some cases, we just said, "It's just better to throw the out of memory error, so we'll just go and tweak the definition." And we'll just say, "If you don't get 5%, then we're going to throw the out of memory error." You can do that, you can even have fun with your colleagues.
Eventually, you run out of heap space and stuff like that, and that's a completely different talk, to talk about how to diagnose and stuff like that, so we're really not going to cover much of that here. I'll just mention that's another issue. The loitering actually gets you closer to these limits. If you have data that you just can't get rid of in heap, then it's going to fill the heap up, obviously, and then you're just going to have less space to work with. Getting rid of some of these reductions, reducing these cached objects and stuff like that, can have some really huge benefits.
Escape Analysis
Now I'm going to talk about this little thing about escape analysis, mostly because it gets important when we start looking for hot allocation sites. Does anyone have an idea of what escape analysis is?
Participant 6: The lifetime of variables.
Pepperdine: Lifetime, yes. We'll just go over this very quickly. Escape analysis is a test I'm going to apply to a piece of data, and I'm going to say, "What is the visibility of this data? How is it scoped?" If it's scoped locally, that means that only the thread that created it can actually see it. If I happen to pass that data off into a method call, then that's going to be called a partial escape. Of course, a full escape is if I'm in a situation where the data is scoped in a way that multiple threads can actually see it. Static is an obvious example; if I declare things static, then I'm in a situation where multiple threads have visibility to this particular piece of data.
The partial escape is the interesting one, because you might think, "Too bad. I just pass it off as a method call into this other thing, and it comes back, and it's still all local." Unfortunately, the JIT compiler hasn't really figured that out. If you want to go see a better escape analysis, then I would suggest you go see the Graal talk. Graal actually goes the extra step and said, "If it's only one method down, and it looks like it's passed the visibility test, I'm going to say there's no escape here." The point is, in order for me to apply an optimization to how a variable is allocated, I have to know what its scoping is, I have to know what its thread visibility is. Escape analysis is going to give me that information. As I said, this will become important when we actually look at finding the hotspots and doing the diagnostics to figure out what's going on.
Demo
Let's have some fun with demos. I have this application here, it's just a funny little MasterMind, kid's game, you might have played it. Instead of using colors, I use numbers. Indirectly I gave it three numbers, and I'm asking the program to guess what three numbers I gave it. What's happening along the way is that I'm scoring each of the guesses, and as I score the guess, the computer will use that information to refine its guess. If you played the MasterMind game as a kid, then you know it's like a color-guessing game, so you and your partner that you're playing with has to score your guesses to give you more information to refine your guess. This is the same idea here.
In this case, I gave it the answer 0, 1, 50,000, and 3, which runs in approximately 12 seconds. The question is, what's the bottleneck in this application now? What's the thing that's preventing us from finding the answer faster? If you look at the GC log, there's not a lot of data here. I don't want to go through a whole analysis here, but I'm going to look at the GC log. Look at this, application throughput 98.9%, that means GC is taking 1.1% of the time. You might look at this and say, "The garbage collector is not really an issue here." You can look at the overall pause times and the pause times are all well below 10 milliseconds. If I was to tune all of the GC out of this problem, I would save 22.6 milliseconds. Not a lot of time.
The point is, there's not really an indication here that we have a problem until we actually get into allocation rates. If we look at the allocation rates and you use the guideline that I gave before, the magic numbers, you can see that we have this downward trend, so as things warm up, there's something happening here, but essentially we settle out at about 2.5 gigs per second. That's what I call a hot allocating application. The question is, what are we going to do to find it?
We could use some sort of memory profiler, so let's just bring that to bear. Here's our MasterMindGUI. I'm going to go over to this funky little profiler here, I'm going to go and check the filters and all the settings. There's a reason why you want to check the filters on your profilers. Do you know what the best use of the filter is? Filter the bottleneck out of your profile. If you want to say, "There's no bottlenecks in my code," make sure you set your filter sets properly. Make sure you understand how your profiler works, make sure that you understand what your filter sets are, what's going to be profiled, what's not, and everything like that.
Since I'm clueless, I'm going to set my filter set up in the Clueless mode. If you're not clueless, then you can take a chance and set it up in the Not Clueless mode if you like, but I'm just going to say I'm clueless. Then it's really easy, this is this just profiling, which is really boring. Now we’ve got all these columns and numbers, and it just looks like a bad pizza night. How are we going to make sense of this?
Hot allocation rates - rates is frequency - so I want a frequency measurement out of this bloody thing. What I'm going to do is I'm going to take allocated objects, that's my frequency measure. I'm just going to focus on that one. I don't care about size, I do care about frequency. I could allocate 100 bytes, I could allocate a megabyte, it's the same amount of time. This is not a size issue, it really is frequency. Now we can go back to our application. This is a good thing about performance tuning, you don't have to do a lot. It looks like we’ve got a runaway train here on Score. Let's take a snapshot and dig in. There's our Score constructor, Board.isPossibleSolution.
Let's take a look at the Score object and see what's going on here. You can see it's got a couple of ints and it's got this jackpot boolean thing. In other words, we won. It all looks pretty simple, we just go over to this Board thing again. We look at it and say, "Wait a second. Something's going on here if I look at this code." You probably have to do something like a score.reset here. There you go, we got rid of the allocation, but have we gotten rid of the allocation here fully? Or did we actually need to do that? I think this is a better question. Let's answer that question by applying an escape analysis test on the original code. How many threads can actually see the Score object?
Participant 7: One.
Pepperdine: One. Is it being passed to another method? No. That seems to be the condition where it passes escape analysis. So, that allocation should actually be eliminated by the compiler because it doesn't pass escape analysis. It's one of these small objects, two ints and a boolean, which means that we can actually do a non-stock allocation of this really quickly. The question is, why is it showing up in the profiler then? Maybe it's not being eliminated?
I'll bring another tool to bear on this one, which is called JITWatch. It was written by Chris Newland, it's an awesome program. He's got all kinds of material on the web to describe it. We don't have enough time to redo what he's done, it's really awesome. Essentially, what I did is I told the JIT compilers to tell me what you're doing, we load it up in this tool, and one of the things it does is it picks out allocations. If I to com.kodework.mastermind, JITWatch is telling me that the Score actually has been eliminated, that allocation site has been eliminated from the code. I can view the bytecode, and we can go through it.
You can see it's saying, "There's our allocation site. It's been eliminated, it doesn't exist anymore." Why is the profiler telling us that we have this hot allocation site that the JIT compiler is telling us that we don't have? This is one of the ways that profilers actually lie to you, they go in and they disturb the runtime. What happens is that the compiler is going to do bytecode injection, and it's going to wrap the allocation site in a method call so that it can track it, and it'll probably put it into something like a phantom reference or something like that, which disturbs the runtime even more. The point is, I can adjust the code and I can run it, and you're going to see no difference in the runtime because that allocation site is actually going to already be eliminated by the optimizers, by the JIT compiler.
Participant 8: The profiler will tell you it doesn't know that.
Pepperdine: Actually, no, it won't. I can run it and show you, but the point is the profiler is picking up this allocation site because it does a partial escape. When it does a partial escape, it also exposes it to another thread. Now, the effect of profiling has changed the runtime, so the optimizers have to react differently to this particular situation. Therefore, this is lying to you. If you want to know the real hot allocation site from the profile, it's actually not here. It's actually going to be this object which is going to be wrapped in this little int array down here.
The long and short of it is that when we actually do this work, we've recognized, "Ok, we got this problem," and then we come around to actually fix it, we have to also understand how our tools are interfering with our ability to actually understand what's happening in the runtime also. When we actually go around to make the fixes, we're fixing the right stuff. If you look at the deployment cycles for this to get it back out into production where you're actually going to see the real picture, you're not going to see it in your test environment. Forget that, that's not happening. When we spin around and see what's happening in the real environment, then it’s like, "No difference," and you're left there scratching your head going, "Why?"
Questions and Answers
Participant 9: Can you explain how you get to the point when you know that it's the integer array?
Pepperdine: If you look at allocated objects and stuff like that, probably going to go after this object array, and we'll just do a study on it. I'm just going to go top to bottom down the list and try to figure out just basically what this thing is involved with. Is this allocation actually being eliminated or not? And just answering those types of questions. We can probably take a look at what this is involved with already in our snapshot.
This looks like an artifact of profiling to me. You can't hide from the profiler in this case. That's what that actually looks like. If we get into the next one, indexing, we can see that now we're back into application code, and we can actually take a look at what this particular piece of logic does to see what we can do to change it. In this case, I think I was using BigInteger, which is absolutely horrible. If you do the math, you can see that I could just replace this with a primitive long, and of course, you're going to get the corresponding huge performance improvements when you make that change.
Participant 10: You gave an absolute number for garbage collection.
Pepperdine: The allocation rates?
Participant 10: Yes. If you're talking about something like Spark, and very large machines, do you have any comments on how big a JVM should run?
Pepperdine: No, I don't. These numbers seem to hold no matter what piece of hardware we're running on. We've actually used some machine learning, or AI, or whatever you want to call it. That's not my specialty, I have a guy who has actually been doing that for us. He's validated these numbers running a whole bunch of different experiments, and this is just numbers that we've pulled out of production environments in general.
Participant 10: It doesn't matter if you have a 32 CPU, or...?
Pepperdine: No. Actually, I think the message from the keynote and what Brian [Goetz] said today is that CPU’s, it doesn't really matter if we make them faster or not. They're saying that they can't make them faster, that's not the issue. The issue is that if I need to go fetch data from memory, I have stalled the processor for up to 300 clocks to wait for that particular data. S typical application is going to go 40 clocks active, 300 clocks wait, 40 clocks active, 300 clocks wait, and so on - something like that, along that lines. If you think about where we're going to get the biggest performance gains going forward, it's filling in that 300 clock.
Now, we used to have speculative execution, but we had a conversation about that. The sideband attacks and everything are slowly wiping that off the table. The gap between CPU speed and memory speed has been increasing at about 8% per year. Some problems are only just getting worse over time. Fortunately, as we've been told, that should tail off, and hopefully we can start focusing on peripherals, the things that are much slower than CPU. We can fill in the gaps naturally by having faster memory, faster buses, and things like that. We can just feed the CPU. I think that needs to be where we focus our energies going forward. These problems become less apparent, I think. We didn't see this problem 10, 15 years ago so much, mostly because the differences in speed between CPUs and memory was not as great. You just didn't pick it up as much, but it's over the years just become a bigger problem.
Participant 11: When I think about one of my applications, I try to make it where the application state, the objects are immutable. Then when we receive some sort of event that requires to mutate the state, we copy-on-write. We're necessarily producing a lot of garbage, and we do a lot of allocations there. How do you reconcile that with the desire for immutability?
Pepperdine: Mutable shared state is evil. Immutability is evil, as you can see. Really, you need to find some place where you have mutable state that's not shared. That'd be a sweet spot, wouldn't it? If you make the design decision to automatically make everything immutable for me, then you've handcuffed me. I've got two choices. I can either not use your stuff, or I'm just going to take the memory penalty for using it. That's really the only choice I have.
If you make the state mutable but not shareable, then I think you are probably in a better world. That just comes from better design, and understanding how to design things so that instead of exposing mutable state, we're actually containing it properly so that we can actually control things like actor models. We use Vert.x in our projects, Vert.x is great for doing this. We maintain stateless parsers, imagine that as an example. We don't have to worry about it.
See more presentations with transcripts