Bio Monica Beckwith is a Java Performance Consultant. Her past experiences include working with Oracle/Sun and AMD; optimizing the JVM for server class systems. Monica was voted a Rock Star speaker @JavaOne 2013 and was the performance lead for Garbage First Garbage Collector (G1 GC). You can follow Monica on twitter @mon_beck
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
Yes. Hi. I am Monica and I am a Java, JVM, GC performance consultant. I worked with AMD, Sun, Oracle. I started as a compiler engineer, I moved on to JIT compilers and I did some JVM heuristic optimizations and then I finally moved on to GC. I worked with server class GCs mostly, but also some client work - I have done in the past.
Robert: I believe you were the performance lead for a new Garbage Collector in the JVM. Talk about that a little bit.
Yes. When I was working for Oracle, I was the performance lead for Garbage-First Garbage Collector. This was not a new collector to us – G1 GC has been in the works for a long time – but my job was to try and figure out certain heuristics. For example, when you are running it with say fusion middleware applications - how can we make sure that running with minimal options, for example just a heap and a pause time goal can get the G1 GC to work and meet the SLAs - the pause time SLAs? So things like initial and maximum nursery bounds and all those things were a part of the performance work that I did and also learning and educating people about the mixed Garbage Collections, how they are different, what are the tuning knobs and kind of trying to get that information - First to understand it myself and then to get at least the default values to be good enough so that it works for fusion middleware and applications that may require even more than 4GB. So when I was working for Oracle, I worked with heaps as low as 2 to terabytes of heap. The whole goal was to learn and educate and get the defaults right.
It is a good question. First of all, it has a lot to do with the application. So if I am working on a benchmark, it is different than when I am working on a customer application. It depends a lot on the application. With the benchmark, I already know what the benchmark is about and I already know what metrics I am targeting with respect to the benchmark. In the case of the customer application, I need the customer to talk to me. So most often than not we start with simple questions and exchanges about what we need, what is the worst past time that you cannot exceed at all, what your SLAs? The first thing that you would want to do is to have a defined SLA for your system, for your performance work that you want to do. So those are the things I start with – simple questions. I do not try tuning right off the bat. What I want to see is that now that they have said what they want, let us just run a base line and see where we are at and then we can look at the logs and various things, we can also do application profiles and a lot of other things. But just to get a base line to see what we want and where we are at now, that is the way work starts.
Very good question. How much can we get from tuning what we have rather than moving away from it, knowing that the other collector is going to get us there. So when do you give up? Is that what you are saying? For that, you need some basic knowledge of different collectors out there. For example: CMS (mostly concurrent mark-sweep collector) is prone to fragmentation. So in the end it can have concurrent mode failure. You know these things about the concurrent mode collector. Now, if the application’s mutation rate is high so CMS cannot keep up, what happens is you will get these concurrent mode failures and then the other part of it is fragmentation too. So yes – you can do so many things. People have probably worked with CMS for such a long time that they know these bounds and they have done various tunings to the application so that they can work within these bounds. But then things change. Depending on your application, you are moving on to a new system. There are so many things that could change and now you go back to tuning CMS and you realize you hit those issues and you know you have reached your maximum bound and that you need to move. Of course, depending on your SLAs now, you can choose a different GC. Many people that want to move away from CMS do go directly to G1, which is a good thing. But the way I like to think about it is that when people want to move, they should always - I am not talking about low latency applications – but I like to give the throughput collector a try first so that way you know - especially if you are having lower heap sizes – you know how the old generation, how the monolithic collection of the old generation, how much time it is going to take for your application. Now with G1 GC the old generation collection, which is called mixed collection, is incremental so that way, if you look at the pause distribution you would see a plateau maybe after the 99.9 percentile and onwards. So G1 CC is more stable when it comes to the pause distributions because it has incremental compaction.
4. Just now you mentioned pause distributions, a moment ago you talked about worst case pause – when you are working with customers and they have requirements, what type of pause requirements do customers typically express to you?
The customers that I worked with when I was working with Oracle or even now, as a consultant – the worst case falls around two seconds and some of them are OK with 5 seconds, some of them request less than 1 second, but most of them fall in on 2 seconds worst case.
That is the education part of it. We tell people what they are interested in. They tell me that their worst case is 2 seconds and then I go study the work load because I just cannot say “OK. Your worst case is 2 seconds. I am going to give you a 5-9 of x” I have to look at the application to be able to say that. So that is my job. When I study the application, by looking at the GC logs, I understand the application patterns a lot and I can tell that I can tune xyz and when I do the go ahead and carry out the tunings, I do provide everybody with 5-9s and worst case and all these metrics. What I also like to do is do a comparison. You were asking me earlier how do you move and when do you decide to move to a different GC. Most of my work – I work with all three GCs and I provide them with a comparison because when you look at it, for example I take G1 GC case, when you look at G1 GC, your worst case and the difference between your worst case and your 5-9s is not that much. When you look at CMS, there is a lot because if your CMS has a concurrent mode failure or a full GC for compaction purposes, your worst case is different than your 5-9s. So I always like to provide that information to the customer so that when they are selecting GC algorithm that they need, they do not just look at the worst case, they do not just look at the average, they also get to see how the application and thus the end user see the application responsiveness. So they get to chose which one they want, how they feel about a particular garbage collector based on the information that I provide them.
It is both actually. The ones that I worked with – there is the Deterministic GC from JRockit, but I have not worked with it for a long time now. The one that I work with, for example, G1 GC, it does have this tuning knob, the MaxGCPauseMillis with a default of 200 milliseconds. But it is a soft real-time goal. Like you said, it depends on various other things. Basically, it depends on the application patterns. There is a prediction logic inside G1 GC that kind of predicts how much time it is going to take you to next pause. For example, when you are doing a mixed collection, which is after concurrent marking, you have the liveness information and there are various other tuning things with respect to mixed collection, but in the simplistic form, you will add more regions if your soft pause time goal is not met. So if the prediction logic thinks that you still have room, it will add more regions to collect in its collection set from the old generation regions. So yes, it is both, like I said because it is not a hard real-time goal.
Right. There are different GC algorithms and some of them are moving. All HotSpot GCs are generational. So when I am talking about the young generation collector, it is basically the collector that is used - for example in parallel throughput collector is the Parallel Scavenge, in CMS it is the ParNew. These are parallel collectors and they are called the copying collectors. That means objects have to be moved. When you are doing that, what happens is that you do not want the mutators, which are the application threads, to be working at the same time. There is also another reason why you need to stop the mutator threads and that is because when you are doing the root set - when you are trying to figure out what is live, you are going to look at the root set and then you are going to look at the thread stack. Each thread has its stack so you have to go look at the root set there. So you need the threads to be at a safe point when you are doing that or the threads can help you themselves. But then again, for that particular moment, the threads have to stop when you are looking at their root set too. The way HotSpot works is that it brings all threads to a safe point for its copying activities and for looking at the root set and other things. In the case of C4, the Zing VM from Azul, it only does what is called a checkpoint, I think. Each thread goes through a checkpoint and that is because it wants to look at the thread stack. Then after it is done, it is done. It goes away. So yes, you need all the threads to come to a checkpoint, but once they have visited a checkpoint, they can go and do their work. So it does not have to be a global safe point, it is individual threads coming to a checkpoint. So it is different GC algorithms have different ways of handling stopping the threads.
8. I have been to some talks at QCon this week in which server implementers have discussed the choice of language. Some are advocating non-garbage collected for server applications that need to run in low latency or predictable latency environments. What is your opinion about that debate? Do you think people choosing C or C++ are going in the right direction?
That is their choice. I know certain things about people writing low latency applications. The way they work with a Java application is that it is not used to its potential so they restrict it: they restrict allocations, they restrict garbage creation, they restrict using certain APIs, they do their own pools outside the heap area. There are so many things that, even though it is a Java application so to speak, but there is so much restriction that is brought in and rightfully so. You want to be able to control the amount because when you are using a HotSpot, it all stems back to the same stop-the-world pause. It is not just the full, the old generation collection. When I spoke about the young generation collection, those are stop-the-world too. And it depends a lot. Those things are linear to your live objects, right? So that is what scares people: not being able to get the response back on time. When a garbage collection pause happens, not being able to tell how long that pause is going to take, that is what they are scared of in low latency applications. So, yes. People have their own ways to work around even with Java applications. If somebody mentioned they have been using C/C++, I can understand why they want to do that because probably they want to use the system resources to the maximum. Remember Java has the heap. What people try to do with low latency applications, they do not want the pauses to be frequent. The frequency of pause depends on the size, but the pause duration itself depends on your live data. So it is a nice situation where you grow the heap, the young generation, but if your live data is growing as well. The even though they are far apart, each pause may take a longer time. So if you are that into micro-managing your allocations and your deallocations and stuff like that, then yes, maybe that would be a good option for you. But again, I know of lots of people using Java applications for their low latency work/ application.
9. If I understand your response, if you have a very strict concern about pause duration and you choose to program in Java, you would adopt certain programming techniques to manage memory in such a way that the garbage collector will not see it?
Yes, sometimes people do that. It is true.
OK. Yes. We are talking about modern, like right now. We are talking about the garbage collectors available right now. I know about HotSpot GC. I am going to go back to the young collection pause because that is easier to explain. I can talk about old generation, but it is different for each collector. So it is simpler in the young generation. So when you have a young pause, a young generation collection, there are multiple parallel GC threads employed. These are called the GC workers and they do work stealing, they do everything that you would do with respect to work queues and stuff like that. But not all work is parallel or parallelizable. Some of them is done in serial. For example with respect to G1 GC, the main part of their parallel work would be something like object copying which takes up most of the time. Then there is like the external root scanning and all those things, but after the parallel work is done, there is some serial work, which means it has to happen one after the other. But within the serial work, you can have multiple threads as well. So, yes, Amdahl's law applies because just the way the nature of the work has to be done. But sometimes you can have more threads doing that serial work as well. Sometimes you cannot, but sometimes you can. Does that answer your question?
11. To some extent, if you have more cores, you can devote more of those cores to doing the garbage collection work so your system as a whole is able to do more work, but does Amdahl's law give any kind of hard limit? Can Java use 16 cores, 32 cores, 48 cores? Is there any practical guide to machine right sizing so you really cannot use all the cores when you have more than a certain number?
Well, it depends on you, as an application writer. Simple benchmarks, for example: when you have a paused way for doing GC work, you want the pause to be done as quickly as possible. So you want to employ as many threads as possible. But within the same pause there is some serial and some parallel work. Again, the goal of the GC work is to get done as soon as possible. There is a setting called ParGC threads or parallel GC threads which you can set on the command line. But it also defaults to I think 5/8s – I would have to go look it up. There is a way that it can calculate the default and does not utilize all the CPU cores, because you may need CPU cores for some other non-GC work. But again, you can override it and you can choose as many cores you want to give for the parallel GC work. As an end user, you can choose and override that default for parallel GC threads.
Now comes the point where I should introduce the old collection in HotSpot collectors – openJDK HotSpot collectors. When I was talking about the throughput collector, the CMS, G1 - and I was concentrating on the young generation because it is easier to explain the GC things with respect to the young generation. But, when you look at these collectors, the main differences arise in the way the old generation is handled. For example in parallel GC, you can now employ parallel GC threads to do a monolithic mark-sweep-compact. CMS does not do compaction. It does in-space deallocation. So, you could end up with fragmentation. Once fragmentation creeps in, then there is a failsafe GC called the full GC which is basically the one that is used for serial GC. So it is a serial mark-sweep-compact. In G1, what has been introduced is called incremental collection and it is called mixed GCs. So whenever a certain threshold – it is called a marking threshold – is crossed, during the marking you get the information of the liveness of each region in G1 and with that information you can collect incrementally so that the regions with the most garbage are collected first. Then you have thresholds saying that “I just do not want to collect if my regions are x expensive” or “I do not mind having my heap being wasted because I want the collections to happen faster”, because remember – it is stop-the-world. A mixed GC is also stop-the-world. So with this knowledge, you could almost always tell that G1 may be better suited for larger heaps and it is simply because its incremental compaction and you are able to control the heap waste percent, you are able to control the liveness threshold and the other things too and then the marking threshold itself too. So you know your live data size, you know your transient live data size, you know how much heap waste percent is. Armed with all this knowledge, you can adjust your – I call it “IHOP”, meaning initiating heap occupancy percent. You can adjust that so that your marking cycles are appropriately started and your collections happen. So, I guess I would suggest, if that is what you want to go with 244, give G1 a try and surely you will see that G1 does better than of course the throughput collector and maybe, in many cases, the CMS too. And actually, whenever I have work with CMS, I have not found people working with those high heaps at all because they are all aware of the drawbacks of CMS. So just the knowledge of these collectors kind of helps you understand how much RAM you can work with, because remember – fragmentation is the problem with CMS. Unless you have an application or your ecosystem is build such that it can handle fragmentation, some people do actually, then you can work with larger heaps too, but fragmentation is still an issue with CMS.
13. You mentioned rather looking at logs to understand garbage collection behavior. What are the types of problems are you mainly looking at log files or are there any other tools or good tooling available for understanding garbage collection behavior?
There are a lot of good tools out there. Me, personally, I write my own scripts, my own tools just because I do not just work with one version of Java. I have worked with Java 6, 7, 8 and I work with different print options as well. So I look at the survivor, the thresholds, I look at reference processing information, I look at different things. So when I get these GC logs I get them with different information in there and I like to parse it the way I want to see the information. So yes, I do look at GC logs, because when I plot them, when I visualize them, I can see some patterns and I can see some anti-patterns and stuff. But I also like to look at the logs to see what led to a particular thing that I am seeing, for example, and that just does not depend on that particular moment, but it depends on a couple of other pauses before it and kind of have the build up to it. So I like to understand it and I am very comfortable. I have been looking at GC logs for more than 10 years now. So looking at the logs is very comfortable. In fact, the more information the logs provide, I like it better than having just verbose GC or something for example.
Robert: That was my final question. Monica Beckwith, thank you for talking to QCon SF.
All right. Thank you.