BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Java 8 LTS to the Latest - a Performance & Responsiveness Perspective

Java 8 LTS to the Latest - a Performance & Responsiveness Perspective

Bookmarks
49:31

Summary

Monica Beckwith and Anil Kumar discuss Java JDK 8u LTS to the latest JDK 13 major changes, as well as impact on performance and responsiveness to backported JDK 11u LTS.

Bio

Java Champion Monica Beckwith works at Microsoft. Prior to joining Microsoft, she was the JVM Performance Architect at Arm. Her past also includes leading Oracle’s First Garbage Collector performance team. Anil Kumar works as a Performance Architect for Scripting Languages Runtimes at Intel. He is one of the earliest contributors to Java virtual machines GC, large pages, profiling etc.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Kumar: Let me start the discussion. Monica [Beckwith] will talk about the customer. There were definitely three issues, one was related to what are the licensing terms. I think there is a talk later in which I'm not that involved; in licensing there are other experts. Second one was the monitoring and observability, that have not been great in 7 or 8 and there have been significant improvement. I've seen there are other talks about the monitoring part, so this track also does not cover the monitoring. My expertise comes more from the benchmarking area, performance and responsiveness.

I got this email from Dev one and a half month ago or so, "We are evaluating this Intel latest platform and I'm doing this runs and I'm getting a variability of almost 40%, 50%, 60% from run to run," and I'm, "No, I have not seen such things." You can see the two metrics he was talking about. One is full capacity metric, when system is fully utilized, only in case everything else fail and only fuse system are carrying the load, then performance is very repeatable. Where my production ranges with regard to 30% to 40% operational, then I see this 60% to 70% variability from run to run and that start to gave me some idea, "What is going on here? Because we don't see such things."

What Test Is Running?

That lead me to ask the very first question to that person: "What are you running?" This is a pretty large company. They said, "I have around 3,000 to 4,000 applications, some of them are very small footprint." These are microservices that talk to each other and some of them are very small, like two gig. We run them usually in two or three gig heap. Very few of them are very large, we go up to 100 gig heap. The problem is, he said, "I cannot take anything from my production environment and have the confidence of repeatability running on the systems." Then, what I go run for traditional benchmark, etc., that we find. He was using a benchmark which he said, "We run our system in productions, so we know the behavior and we have done the benchmark part and they pretty matched with regard to whether it is GC or CPU utilization or network I/O in most of the situations. We have created this proxy where that's what we are running." Ok, that helps us because we can run a similar proxy.

Deployment Environment

These are some of the components we ended up running on our environment. The traditional environment could be your app and the JVM, it could be running in a container and from that container, when you launch a process, you could have number one case. You launch the process, it goes to the one of the socket – there is the two-socket system. The process will start on, let's say, the other end of the socket and it will get the local memory, so number one, you are running on a socket that's getting the local memory. In number two, you start your process, it starts on other socket and then it does not get its own local memory, it gets the memory from the other socket. These are the traditional two-socket system. In third case, what happens is that you launch the process again in the second socket and it gets the local memory.

Now let me ask this question: between one and two, which one is going to be better performance? One, because the local memory, and that part could cause significant variation in certain cases. It will not get a result based on the application on the total throughput because we are talking memory latency of 100 nanosecond to 140 nanoseconds, but it can make a big difference on your response time, the latency. Anytime you are sensitive to latency, that path can make a big difference. That is one variation for the responsiveness, why that responsiveness could be different, but not the total throughput.

Second part is not covered on that picture. Many of you may be using the containers or the virtual machines and when you are setting the thread pool within your application, it depends on how many CPU threads that API call gets. The answer to this, if you're running on a large system, it would be all the core – for example, this system is running 112 threads, it could be 112 threads – or, if you're running within a VM and that VM only gives you 16 threads or so, the answer would be 16. In one case, you would be setting your thread pool based on 112, another just on 8 or 16.

As the app writer, you don't have the control, but in the deployment scenario, you have to see. We have seen on the thread pool that if you based on fork/join or now the parallel stream – I will talk a little bit later – how do you do the settings? It could have a effect on the context switch, etc., and result in variability. Another part is, your guest also has policies. Some of them, like Docker, allow pinning, other VMs don't allow pinning, and your performance thread moving from one socket to other, could have significant impact.

Last part is in the heap memory, the surprising one from the experiment working with that person. What is happening is that there was around 256 gig of memory on that system but traditionally, a system will keep using for almost two days, three days, four days not rebooted, so, over time, the memory fragmentation happens. Even though you're requesting a 20 gig of heap, you may not get 20 gig of heap. What we noticed is that sometime you were getting 18 gig, another time you were getting only 8 gig. I'll show some data later.

Even though you give the parameter, "Give me this much heap," based on how much fragmentation have been and particularly the transparent large pages, which by default nowadays are enabled, that requires a chunk contiguous. If it is not the large pages, then it could actually give you much more easily that memory. There is a performance benefit of large pages of 10% to 15% and that can make a difference on the responsiveness also, as well as throughput. Earlier, you had to request the large pages. Now, the large pages, in these cases, are all by default and they're transparent, you don't know.

I did not mean the talk today to go over these details of how the process launched, etc. The interest was, can the JDK 8 to 11 help with reducing the variability? Because these parts are out of your control. As a developer, you're not thinking where it is deploying, what are the policies, because the policy could change, the deployment might be source here, tomorrow Docker, or some other things. What we wanted to see is, is the JDK doing anything and could we do certain changes to help?

Agenda

Here are a couple of things we want to talk about. First, we want to give you some informations about one of the big changes from 8 to 11. This talk is not about those changes. I think you are very well aware that there are other talks about those feature of the API changes or some thread pool changes and some data coming on them. What we wanted to show is how the different parameters by default are changing if you really run certain benchmarks and workload. That's one of the sharing parts.

We talked about throughput, we talked about the responsiveness, we talked about the variability, and some startup – that data we want to share. There are a lot of variations in those parts, so Monica [Beckwith] looked at some of the explanations – why it is going that way – so it will help you to understand.

JDK 8 LTS

Let's start with the new use cases. There have been changes in monitoring and code obscurity, but we are not covering those parts. There are new users of the containers, microservices, or Function as a Service or Polyglot programming. We will cover the Function as a Service slightly, and impacts of the JDK 11. Then, for the concurrency part, we are not covering the Project Loom on the Value Type, as we're not there. We’ll talk about one of the benchmark that has the fork/join and about the impact on the thread pool. The networking part is the benchmark using NIO, using Grizzly but not the Netty. The impact is very similar but we are not going in detail of that part. Most of the data you would see from the benchmark are throughput or responsiveness or variability or the startup time.

Let's start with why did we pick the JDK 8, not 7, not 6. This is a survey showing that a significant amount of deployments currently are at 8, that's what they're evaluating, and they are thinking to move to 11. Another part is, which workload do we avoid? Because we did pick certain workloads and what we found is that the JMH, particularly, which comes with the JDK kit, has a lot of variability, almost 50% and some of them even bigger, so we are not using the JMH component even though they come with OpenJDK and many times you might just look at the data coming from OpenJDK. That has a lot of variability and we had to remove that part out.

The next one is heap. I talked about it a bit. In that process, even if you give it a certain amount of heap, you're don't have any guaranty, so you should always put your GC print output into checking environment when deploying to find out how much heap you are really getting. The data shown here is for SPECjvm 2008, which is compute and memory-bound different components and what we wanted to show is, just look at the 20 gig to 60 gig part. Some of them, of course, won't be different if they are not allocating heavily and not getting impacted by that, but there are many components in your environment also which can have a significant impact on how much heap you get from run to run. That data was just to show that even on your cases, when you have wide variety of them, you would have big differences.

JDK 8 LTS vs 11 LTS vs 12 vs 13

Now, let's look at the throughput and performance. In SPECjvm 2008, it has almost 13 or 14 components. For JDK 8 to 11 and even up to 13, there are almost no cases where a performance goes down, there's only improvement. By default, if you just moved from 8 to 11 to 12 or 13, then other than very rare cases, you should see improvement just by default.

Now let's talk a little bit more about performance, response, and variability because you can find many benchmarks to measure the throughput. For performance and variability – Monica [Beckwith] was also part of it as she and I work in the same committee – we created this benchmark for being able to have all three components. The very first one here when the benchmark runs is SPECjbb 2015. In the beginning phase, it is actually doing a binary search, that means that it is loading the system crazily: low value, high value, low value, high value. This is very similar to production environment where you might see the surges in throughput or the load coming in.

At that time, it determines what is your settled load values, so it tries to give you a rough approximation of what kind of load you handle when the request comes in variable or bursts. The second part is called the response throughput curve, where it slowly increases the throughput and it keeps increasing and it keeps measuring your response time, which is a 99 percentile response time. From that, it keep loading the system and you get two metrics, one is the max-jOPS which is the full system capacity, what will happen if your system in the failover mode and need to handle. The second one is the critical-jOPS which has an SLA, that is a geometric mean of 5 SLAs, 10 millisecond, 25 millisecond, 50 millisecond, 75 millisecond, and 100 milliseconds. It tries to check what is your throughput when you have those SLAs and that is called critical-jOPS, that is more responsiveness. We have seen for different scenarios that it is in the range of 30% to 50% of the system utilization and that is where most of the production environments systems also operate. These are the three metrics that we get from this benchmark.

Now let's look at some of the data we got. This data is for the JDK 8 to 11 and it is full system capacity and responsiveness. What you will see between 8 and 11 is that full system capacity makes almost no difference and you may be surprised by the reason. I talked to the person in the team who works, on several of the JIT-related changes, even for 11 and 12, and they have been backported to JDK 8. If you pick the latest JDK 8, several of the JIT-related changes are backported, so throughput-wise, you would see a very similar performance for many situation, that's the reason.

The responsiveness, the critical-jOPS that we were talking about, that get impacted by your response time for each transaction because that is what four SLAs mean. That is where we see almost a 35% improvement from JDK 11. The reason is that, in 11, when we look into detail, it's mostly the G1 GC. Because your JDK 8 by default has the parallel stop-the-world GC, anytime you are doing a GC, it'll ended up anywhere from 15 milliseconds to almost 300-400 millisecond based on what kind of GC happened.

On the other hand, G1 GC takes a hit on the full system capacity, that is where both are matching, but it gives you much better critical-jOPS where the response time is never more than 20-30 millisecond. Monica [Beckwith] will cover later in more detail why and how that happens, but that is part of the reason why the responsiveness is much better on JDK 11, not due to any other component but on the G1 GC by default.

Let's look at the variability part. How are we defining variability? We are doing 10 runs or more and then we look at the total maximum throughput and all the critical-jOPS we were running under SLA changing, that is run to run variation. That is very similar to you take your test case and launch it 10, 20, or 30 time on the same system. That's what we do here also – we do some other work in between and then launch the system, do other work, and launch the system. What we are finding from the variability perspective is that JDK 8's standard variation on the throughput is around 2%, but the standard deviation for the critical-jOPS which is the responsiveness, is almost 10%, so very high and really bad. On the other hand, due to G1 GC, JDK 11 has a predictable response time.

On the other hand, the Parallel GC stops are not predictable. They can be from very small to very large and that's what causes the variability part. Again, the variability is also related to G1 GC improvements. That is the main component we are finding in addition to, of course, the APL, that thing you get. By default, the variability and responsiveness are mostly coming from the G1 GC part.

Now, let's talk about a bit more about startup type thing that I was talking about. When we are going Function as Service, we are all talking about something coming out in few milliseconds to, say, one or two minutes. So far, for any workload we were talking, we were talking at least 10 minutes; 6 to 8 minutes to almost 2 hours running. SPECjbb 2015 runs 2 hours, SPECjvm 2008, each iteration runs for 6 minutes, so overall it runs for 2 to 3 hours. DaCapo benchmark have improved their repeatability. I really don't like their repeatability so we are doing several runs. What you can see on the top graph is that several components are barely 500 millisecond each iteration, so it's small.

Similary, if tomorrow you start writing Function as Service, a service needs to make a call, get the work done, and be out. In that case, between 8 and 11, I was surprised that so far, we were seeing that JDK 11 is better in everything compared to 8: similar throughput or better, and much better variability and much better responsiveness. Here I was surprised to see that in several of the component, JDK 11, higher is not better. It's lower because I'm dividing JDK 11 execution time by the JDK 8 execution time, so lower is better.

For this one I discussed with Monica [Beckwith] and then she looked into the logs. This time G1 GC is not doing good and she will explain why, in particular for this kind of instances. There are situations where, for G1 GC for this short startup or Function as Service type scenario, we plan to investigate more and give feedback to OpenJDK, that's what we plan to do. I think that's the part I want to do – share the data. I work with Monica [Beckwith] and she has explanations for several of these things we worked on.

GC Groundwork

Beckwith: All the good fun stuff was covered by Anil [Kumar], so what I'm going to do here is provide a little bit of groundwork on Garbage Collections. In certain observations that Anil [Kumar] had, I'm going to provide explanations for those as well. Here are som very basic stuff with respect to a heap layout. Usually we talk about heap as contiguous chunk like that. With the newer Garbage Collectors, you would also see something called regions, so that's a regionalized heap basically, and these are all virtual space.

Then there's also the concept of generations, so you have the young generation and old generation. For example, in G1 GC's case, you will have the generations as well as the regions and that's typical configuration of the heap when we are trying to explain the basic heap layout. For now, Z GC and Shenandoah are not generational and G1 GC is generational, as well as it's got regionalized heap. Also, because we are comparing JDK 8 and JDK 11 or 13, I've also provide the numbers for Parallel GC. Parallel GC used to be the default for JDK 8 and starting 9, the new default GC is G1, so that's why I wanted to provide comparative numbers.

Parallel GC is not regionalized. I will go into details of this later when we talk about DaCapo but something to realize is that when we talk about generational heap , Eden and Survivor regions will form the young generation and Old regions as well as Humongous will be from the old generation. This is very important to know and I'll go into details why. For a user or for even a GC person, it all boils down to occupied and free regions. Basically, if you are generational, your young generation gets filled and then you either promote or you try to age them in the Survivors, and eventually you will have just a bunch of occupied, so basically long-lived objects or whatever.

Again, all the free regions are maintained in a list and any of those occupied regions could be young or old or humongous, which is allocated out of the old region.

GC Commonalities

I want to quickly compare different Garbage Collectors and the reason I wanted to talk about the other two, Shenandoah and Z GC, is because that's the future, that's where we're headed, so you'll see this trend, which I'll cover here soon. The entire thing boils down to copying. I'm going to emphasize that here. We also know that as Compacting Collector or we call it Evacuation as well, is kind of similar; everything is similar.

Your heap has a From space and a To space and as you start filling up the From space and it gets filled, now is the time for you to do marking. To do marking, you go find out the GC roots, which could be a static variables, stack, any JNI general references. Then you identify them in your From space area and then you start doing the live object graph and eventually, you move the live objects to the To space and then you reclaim the From space. Eventually, the To space turns into From space and you start allocating into the From space, so it just goes back and forth.

GC Differences

This is a simple concept, but it gets a little more complicated when you have generational GC, it gets complicated when you are doing concurrent compaction or concurrent marking and stuff like that. I will not have time to go into details but I wanted to quickly highlight the differences. As I mentioned, Parallel GC is not regionalized but it is generational, just like G1 GC. Compaction does happen in Parallel and G1 and they just use the forwarding address in the header.

Because Parallel GC is throughput-driven, the goal is to have higher throughput, and there's no pause time target per se as long as we keep on putting the throughput to the max so everything is stop-the-world in Parallel GC. G1 GC does have target pause time goal. I'm not going to go into details but basically it's like, "I hope I can achieve this goal," and then the collection set in the regions will be what will be changed as well as you'll find out how much expensive a particular region gets during a collection.

Parallel GC does not have any concurrent marking at all, everything is stop-the-world, like I mentioned, but G1 GC does and both G1 and Shenandoah are Snapshot-at-the-Beginning algorithm and Z GC does striping, which I'm not going to go into details here. There's also the concept of colored pointers in Z GC and their target pause times are slightly smaller actually than G1 GC because both Shenandoah and Z GC are targeting the low pause time market.

Performance

I quickly ran jbb with about 28 gigs. This was explained by Anil [Kumar] earlier; max throughput is basically the entire system when you just fully loaded, that's the system capacity. There's the response and throughput curve that Anil [Kumar] was talking about, that's the max-jOPS metric and then the responsiveness as what he mentioned as critical-jOPS metric. Everything is normalized through Shenandoah's max throughput. The things I want you to take away from here are that Parallel GC is everything is stop-the-world, so, of course, it gives you the maximum throughput. That's the way GC is designed, generational, stop-the-world, so it gives you the maximum throughput.

As you go down to G1 GC and basically what I've done is I made some adjustments to the pause time goal, I kind of relaxed it a bit, so that's why you see that your throughput gets better but your critical-jOPS is pretty consistent, so the last three are basically your G1 GC. That's where the repeatability metric that you were talking about, Anil [Kumar], comes into play. The third and the fourth are Parallel GC, a slight change in nursery produces a lot of variation in Parallel GC's output with respect to throughput or responsiveness as well.

Shenandoah and Z GC have an issue here because they are not generational yet. They achieve copying compaction while concurrently moving objects from the From space to the To space. Those are trying to achieve higher critical response time, which is what you see with Z GC, so it's at 56, whereas anything and everything that G1 GC could achieve was about 49 to 50. That's the target, that's the goal, that's the design of these GCs, they're headed towards providing you much better responsiveness.

It's a lot of information there but the trend that I'm trying to show here is that there is more effort put into getting better responsiveness going from JDK 8 to 11 to 13.

G1 GC and Humongous Objects

Going back to the DaCapo case, I want to quickly talk about G1 GC and Humongous objects. G1 GC is regionalized and it has the concept of region, each region gets allocated a size, JVM start. In DaCapo's case, it was four megs. As and when you have objects getting allocated, they would end up in Eden if they meet certain criteria and if it's an humongous object, they will go and get allocated out of the old generation.

The threshold is basically the size of the object. If the object is greater than or equal to 50% of the region size, so in DaCapo's case, if the object is two megs or larger, then it will be considered a humongous object. Less than 50%, it will be allocated out of Eden, greater than or equal to 50%, then it would be out of the old generation, and anything that's greater than region size would have humongous regions, so that will be a contiguous space right there.

It's same information but expressed differently here. Anything less than half the region size is not humongous, everything else is humongous. If you need more than one region, then it's called a contiguous region and it's also humongous. With DaCapo, one of the things that it does, I guess it's different benchmarks at start, but it's setting up the object. The objects are long lived and because of the four megs space, anything that's two megs or above becomes a humongous object. When DaCapo is allocating these objects, which it's hoping were regular sized because of the lower region size that we have, these are all humongous objects.

When you're doing system GC, what's happening is you're trying to move these humongous regions. Remember that we're trying to have contiguous regions? You're trying to move it from the From space because they are live into the To space and, if you keep on doing it over time, trying to find contiguous regions gets difficult because of fragmentation issues. That's why G1 GC showed you reduced performance.

AOT Groundwork

I wanted to talk about AOT because that's the direction where we are headed: more responsiveness. We're going to have a talk later today about more compilation directions in future by Mark [Stoodley], but this is something that is available since JDK 9 and 10 and it's gotten better over time. One of my colleagues has a very good article, so I'm going to reference the article here. Prior to tiered compilations, before JDK 7-ish, you would have the first execution, it would end up in the interpreter, and then eventually you will have adaptive JIT-ing based on the profiles of critical hotspots. You have thresholds, thresholds are crossed, some things get JITed, and there was a concept of server and client compiler, which is also known as C2 and C1, and then tiered compilation happened.

Tiered compilation was fully supported in JDK 8, I think, and before that, it was experimental. With tiered compilation, the C1 and C2 concepts are still there but the profiling concept is different, so you have limited profiling as opposed to full profiling. There's a good explanation in that link over there. All you have to think about is basically from interpreter to C1 or C2 based on the profiling and the different thresholds and then, of course, there's a deoptimization path as well.

With AOT, what happens is, after the first execution, we do not go to integrator, we actually go to a AOT code and we can do C1 and C2 based on full profile, so when you have C1 with full profile, you could go to C2 and any deoptimization goes back to interpreter. To go into details, please go ahead and read that article; there are many great articles out there on AOT as well.

Performance

What I did is I took the JVM 2008 startup component and I tried to run it without AOT, then with AOT, and with AOT with tiered. The difference between AOT and AOT with tiered is that when you create the dynamic library, you could say, "I would like it to be able to use the tiered compilation path," or you could say, "No, I don't need the tiered compilation path." As you can see, higher is better, so most of the time, you'll see AOT just giving you a straight up win for most of the startup workloads that I have covered here and apologize again for the label getting cut out like that. What was interesting is, if you look at the last three here, the blue one is without AOT and after that, the performance drops.

I was using the same workload as the previous one there. You saw AOT and AOT with tiered actually giving you a benefit, so these two workloads are the same, but the runtime is different, so that is just measured at startup and this one is measured after it's warmed up and now it's trying to achieve steady state. The reason why AOT with tiered gives you the worst performance is that it hasn't crossed the threshold, so when you think about these different compilation improvements, the thresholds change as well.

For example, tier three invocation threshold for AOT with tiered is more than 10X or 100X different than without tiered and stuff like that, so the threshold is totally changed. For the last one here, to achieve similar runs, it will actually have to run more times, so basically, if I run it more times, it would have crossed the threshold and you would have seen a better optimized code, so it would have gone to tier four. That's one of the reasons why we have that reduced performance there.

Summary

Kumar: I think the main key take away from the JDK 8 to 11 or even up to 13 is, if you are running long and steady state, mostly throughput bound, you may not see a big difference between 8, 11, 12, 13. There could be some cases where you are better but usually, we did not find a big difference across these benchmarks. If you are talking about workload with responsiveness where you have the SLAs, you are looking at those low pause time, etc., you would see significant improvement in your responsiveness and you will also see a significant improvement in your variability.

The thing we found to relate to it is mostly the changes from the Parallel GC to G1 GC, because those pause times are giving you more consistency and better end to end response time. If that is your goal on the front application, then it is worth moving or evaluating up to 11 or even higher for latest GCs coming in.

The last one is, when you have the short running workloads, as we saw with the DaCapo also, you may want to check there because there could be some issue with G1 GC and we do plan to give this feedback to OpenJDK to see if they can be addressed, as Monica [Beckwith] just pointed with the humongous object sizes causing the issue, and you're seeing worst performance in that case.

We know that containers are being used very heavily, almost 40%-50%, and the next one on the growth is Function as Service in many areas. I think that scenario will happen, so it would be good, I think, to have that part but definitely right now, you need to watch for G1 GC with regard to short startup timings related workload and situations.

As for AOT that Monica [Beckwith] just talking about, you have to be there also, just watch. If it is long term, then you might be better off without, but if it is short Function as Service type thing, then AOT definitely giving you a faster response times.

I think that we definitely want to know more because we are also in the SPEC committee working on making changes, as you saw. Earlier, the workloads were all like SPECjvm 2008, more just throughput, you do something, compute memory, and throughput. Then we changed to SPECjbb 2015, we change it into more response time where you can see the difference on response time and repeatability. Similar changes we are planning next and we are looking into parallel streams and other changes coming in but if you have the use case from your area, "These are my problem point and these are the use cases," we definitely want to reflect them into the benchmarks, that is one of our goals. They can be used through evaluation, as you saw that at least three or four large customer I've talked to, they have 4,000 applications but they can't use them for testing or evaluating.

Questions and Answers

Participant 1: We use G1 GC with JDK 8 itself as it's recommended. Why should we move to JDK 13?

Beckwith: I think you should go to Gil's [Tene] talk and that would be helpful. In this track today, we did not have an exhaustive list but there are lots of improvements that moving to 11 LTS would bring and even after that, probably the MTS, as Gil [Tene] will talk about. Right now at Microsoft we have a lot of internal customers who are still at JDK 8. We're trying to help them to move to 11 update and we made a list of pros and cons and how it can affect them, like multi-release JARs and other things that may be helpful to them.

It's a very case-specific analysis that we have to do. I can just mention all the features all the benefits that you can get, but do they apply to your use case? Probably not, not all of them. You have to do evaluation or cost benefit analysis for moving to 11. Remember, any of these kind of low latency Garbage Collection or even AOT for that matter, those benefits you will only get when you move to this side of 9.

Participant 2: You're talking about how a lot of the newer Garbage Collectors divide the memory into regions. I was just wondering, is that something that's configurable, that we should be thinking about, like what size region should we use, or is that something that's just handled by the GC itself?

Beckwith: It should be handled by the GC. Z GC is adaptive and so it should not be a problem. Unfortunately, right now it is, and that's what I was showing you with respect to DaCapo. We definitely want to change that, so it shouldn't be something that you should worry about. That was just a bad choice when we started doing G. We'll work on changing that, so you shouldn't worry about those things.

Participant 3: [inaudible 00:44:10]

Beckwith: JDK 8?

Participant 3: Yes.

Beckwith: You do get Shenandoah with JDK 8. I'm not sure which is the status of that right now.

Participant 4: Only the Red Hat.

Beckwith: There's a page maintained by Aleksey, you could probably go check it out but you'll have to use their bits for JDK 8.

Participant 5: Thanks for the excellent metrics, it was very useful. I do have two questions out of this talk. One is, I was also in the talk with GraalVM when they were talking about the JIT and AOT with GraalVM and I do see an AOT function with Java that starts with JDK 9 and GraalVM, they also talked about beng using Oracle with Oracle Cloud. Can you help us to decide on when do you really use the GraalVM AOT with this AOT or do we even start to look into GraalVM or not?

Beckwith: I'm not the best person to talk about that. If anybody else here would like to chime in, I would appreciate that. Did you say when you want to evaluate Graal?

Participant 5: Yes, is there a difference between the AOT that comes with Java towards what GraalVM is offering?

Beckwith: The jaotc tool that I use to generate the dynamic library, is from the Graal JIT. You're talking about VM level differences, but this is more of a JIT level thing. Graal JIT is supposedly the future, which means that eventually C2 will be replaced by Graal Jit, not the VM. I think this is a very big question for me to be able to answer. I'm not the right person to talk about Graal because I don't want to provide any knowledge that may not be helpful.

JIT is different, though, I just want to clarify that. When people talk about GraalVM, unfortunately, it's similarly named, but VM has a lot of different benefits. With JDK 9, I think there was a JVMCI, which is the compiler interface that was introduced and took advantage of the Graal JIT and that's how we get AOT. It's from the same source, I would say, but I have not done any performance analysis on Graal JIT AOT with that.

Mark Stoodley: I'll be talking a little bit about that point in my talk later today. That talk isn't specifically about Graal, I'm not from the Graal team, I don't represent them, but my talk is about some of the trade-offs in how do you choose whether to use JIT or to use AOT or to use any of the other technology. I'll touch on it a little bit, so I encourage you to attend.

Beckwith: I was going to talk about that. Mark [Stoodley] will talk about the future, the JIT directions, so that would be a good talk for you to attend to understand AOT is just the start so there's more. The lady asked us, "Why should I move?" Another reason to move is because of all the improvements that will happen and it will be on the JDK 9 code base mostly.

Kumar: Another answer for "Why to move?" is we have seen in SPECjbb 2015, the fork/join is not easy to use because of the different thing we had to do, but parallel stream have done much better improvement. One of the thinga you may want to find is, anytime you have the thread pool within microservices or other situations you want to do the auto balancing, it really requires some very well-thought, which is with parallel stream in 9 or higher, you might have better jOPS if you do need the load balancing through thread pool.

It is a tricky issue now because if you are within a VM, you might have only 8 core or in your whole system, you might have 256 core and you don't know how your application will behave on those two extremes. There may be some consideration but there's no perfect answer to this, you have to try how it is behaving.

 

See more presentations with transcripts

 

Recorded at:

Jan 28, 2020

BT