InfoQ Homepage Presentations Maximizing Performance with GraalVM

Maximizing Performance with GraalVM

Bookmarks

View Presentation

Speed:

Download

49:30

Summary

Thomas Wuerthinger discusses the best practices for Java code and compiler configurations to maximize performance with GraalVM and how to measure performance in a reliable manner. He talks about how to achieve minimal memory footprint and binary size of GraalVM native images — programs compiled ahead of time to native executables.

Bio

Thomas Wuerthinger is a Senior Research Director at Oracle Labs leading programming language implementation teams for languages including Java, JavaScript, Ruby, and R.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Wuerthinger: Welcome to my talk about maximizing performance with GraalVM. I'm Thomas, I'm working as a researcher at Oracle Labs. I'm a compiler constructor, I did a lot of compiler work during my career both, on HotSpot, the Maxine Research VM, but also on V8 JavaScript engine at Google. Since about seven, eight years, I'm working as the project lead for GraalVM, which is a new virtual machine that is trying to execute many languages faster. What I'm saying in this talk today comes from the perspective of a compiler constructor. I'm not a GC expert or a runtime expert, but I'm trying to convey from the perspective of a compiler guy what is possible and how you maximize performance there.

When we talk about performance, there are many different ways we can talk about it and many different ways to measure performance. The main thing usually people think about when they talk about performance is peak throughput. This means, when your system reaches a steady state, how many operations per seconds will it be able to serve? This is one of the dimensions that you can optimize for, but there are many more other dimensions. A second one is the startup time. It's how fast your application reach a certain amount of regress per second, or how fast is the application in the first few seconds, maybe first few minutes before it reaches its steady state. As this diagram here suggests, this is already a little bit of a trade-off. You can have more startup time or more peak performance depending on what you care more about.

That's not everything, there is also memory footprint. It's becoming more important over time, because if you look at your cloud provider, for example, the main thing they are charging for is the memory. Using half as much memory usually gives you half the price on any cloud provider. In some applications, specific applications that are scaling very well, maybe you don't care about your peak throughput, but you care about your peak throughput per memory, meaning how much throughput can you achieve as a certain amount of memory available. This is a consideration you need to be thinking of when it comes to what you actually mean by performance.

There's more, there are more secondary characteristics people care about. One is max latency, this is a very hot characteristics in garbage collection where you care about your maximum latency for certain requests, and your choice of garbage collector, etc., heavily influences whether you care more about peak performance versus max latency.

Another secondary item is packaging size. How much do you care about the size of the package you're distributing to your users or to your servers? On server-side application, that's usually not that important. It's a big consideration on mobile phone applications, for example, because if your download size of your iOS app is half of what it was before, a lot more people will download it and use your app. This is another dimension you need to think about. This talk is about how can you trade-off these things, and how can you basically maximize the items you want? It's not like you can only pick one. Usually you can pick one, two, three out of those five and get them to optimal values.

When it comes to GraalVM, GraalVM itself, is a virtual machine that can run any language. It can run specifically any Java-based language and it can run on top of open JDK here. It cannot just run Java and Scala and Kotlin and JVM-based languages, it has the capability to run also other languages like Ruby, Python, R, JavaScript. These are all scripting languages the GraalVM supports in addition to the JVM-based languages. We also support running C, Rust, C++ type of applications.

Usually, we would run these applications on top of open JDK which gives you a familiar setup with a Java Virtual Machine. We also have other environments that we can run GraalVM because GraalVM is very embeddable. It's embeddable in multiple environments. We can embed it in the Node JS platform, this means you can run all the GraalVM languages also when you run a Node.js application. We can embed it into Oracle database, obviously, there you can run it as stored procedures. The final thing you can run GraalVM with is standalone, Ahead of Time compact. This is where we create a small package binary for a certain application.

GraalVM comes in two editions: as a community edition, as an enterprise edition. Community edition is free for production use. We maintain about a couple million lines of open source around GraalVM on GitHub, and enterprise edition is acquiring a license. I said GraalVM can run these languages, Java, Scala, Ruby, Kotlin on top of open JDK in your usual way. You download the GraalVM installation, you run Java MyMainClass and you run with GraalVM. The main difference as a normal open JDK installation is that it runs with the GraalVM Just In Time compiler, which is a new Just In Time compiler that's written in Java and that replaces the C2 Just In Time compiler that is usually run when you run on top of the HotSpot virtual machine. This is a very familiar setup for every Java developer nowadays.

A GraalVM has a second way to run things, it can run things Ahead of Time compiled. There, you take your main class of your application, there's a native image command that takes the main class as an argument and it produces a new executable binary, and that binary is all you need to further run your application. This binary includes all the runtime, includes all the code pre-compiled, etc. These are the main two modes you can run GraalVM with, you can run GraalVM in the JIT mode, or can run it in the AOT mode. There are tradeoffs between these two, the JIT mode and the AOT mode. In this talk, I will talk more about when to use which mode and what are the specifics of this tradeoffs. This AOT mode is something that only GraalVM provides and the JIT mode is something you can get with other open JDK distributions as well.

GraalVM AOT for Native Images

This Ahead of Time compiled mode is a way where we package and compile the whole Java application down into a single executable. How do we do that? What we do is we take everything that is consumed by the Java application, meaning the libraries, the JDK libraries, also resources, and what we do is we do a closed world analysis. It's an analysis of what is reachable from the main class of the application, because one problem is that when you create this package binary, you don't want to include every JDK class there is, you don't want to include every library class there is. You want to package exactly what is only used by your application, so we do a flow-sensitive analysis of what is reachable from the entry point of the application. We run some of the static initializers, specifically the one that we can prove as correct to be running at image generation time.

Then, we snapshot the current tip of the Java application as it is and put everything together in a single binary. We put all the code Ahead of Time compiled into the code section, and we also write down a snapshot of the Java application as it was as initialized into the image JIT. This allows you to also pre-initialize certain values or configuration parameters of the application, and you don't need to load them again when you start this binary again.

AOT vs JIT: Startup Time

What's the benefit of this? What's the benefit of AOT? The problem is a JIT compiler in the JIT configuration, there's a lot of work going on when you start the application. First of all, while the JVM executable starts executing, you load the classes, the Java byte codes from your disk, then you have to verify the byte codes, then you start interpreting your byte codes, which is usually about 50 times slower than your final machine code would be. This interpretation of byte codes and starting initializes as it starts up is about 50 times slower.

You run the static initializers as usual in interpreted mode, and then what you do on the Java HotSpot virtual machine is that you create a first-tier compilation. You use a fast compiler, which is C1, the client compiler of HotSpot to create your first machine code. You do that because you want to speed up your application as fast as possible. You don't want to wait for the final machine code to arrive because there's such a big difference between the interpreter speed and the final machine code speed, that you want an intermediate solution. This intermediate solution is provided by the C1 client compiler, and this intermediate solution specifically has something in there that gathers profiling feedback, that gathers information about how the application is running the code.

Profiling feedback includes loop counts, it includes for every branch, it includes the counts of how often did the branch go right or left, it includes information about actual concrete classes that occur at certain places, meaning at instance off or at virtual calls, and it includes some other minor information about the execution. Gathering this profiling feedback is not free, it slows down your code because while you start up, you need to do this extra bookkeeping of the profiling feedback.

Then, when a method gets really hot, meaning you've had to first compile, you've got some profile, then the method is scheduled for the second compiler, and this is when the heavyweight compiler, either the C2 compiler or the GraalVM compiler comes in and uses all the information gathered during the startup sequence of the application to create the final, hopefully, very good machine code, and then you execute faster. As you can see there's a long sequence here, and this is the main reason why an application that is running on the Java Virtual Machine is starting up slowly.

When we do Ahead of Time compilation we don't need to do a lot of these things, because we did the compilation to machine code during Ahead of Time compilation step, and then when you start the app, everything is ready. You do not need to interpret, you don't need to get a profiling feedback, nothing of that sort. Immediately start with the best machine code, plus, if you had snapshots and configuration files, you might even not avoid loading configuration files that you otherwise would load at runtime. Startup time is probably the area where the Ahead of Time compilation is beating the JIT compilation by the largest margin.

We can see that in the numbers, we have here measures of the capital popular web frameworks, and we measured the GraalVM JIT configuration versus the GraalVM AOT configuration in startup time. Startup time here is effectively from the start of the application until the first request can be served on that server. The gains here are huge, you have about 50X on average, and it's not surprising because it just showed you before how much is going on, on the one hand, versus how little is going on, on the other hand. This is opening a new way you can run a Java-based web server, and suddenly changes a little bit the equation about, "Well, if I can start up in these 16 milliseconds, I don't need to keep the process around all the time." When the web application is idle, you just shut down the process. You can almost start a new process per request if you want with this type of speeds.

Whether the compilation time is relevant for application or not, you can find out with the profiling tool called the Java Flight Recorder. Who of you has been using Java Flight Recorder? About a dozen. Who of you has ever looked at the compilation times with Java Flight Recorder? One. I also didn't know how awesome this feature is of Java Flight Recorder, but I want to present it to you here.

In Java Flight Recorder mode when you go to the Java application, you look up how the threads are doing. You can actually select the compiler threads. In the Graal case, these are called JVMCI native threat. In the C2 normal open JDKs, they're called C2 threats. When you select one of these threats, you will see here a little diagram press red, and yellow here means that you have a compilation running, the compiler is doing something.

This is an interesting workload, this is compiling Apache Spark. Why are the Scala compilers on the JVM? It runs for about three minutes, and what we can see is we have three compiler threads here that are all busy all the time. After these three minutes, I'm done compiling my Apache Spark and at the same time I've used tons of machine code that will be useless now because I'm done after three minutes. This is a typical workload where the compilation speed of the JIT compiler is very important. You can figure out whether the compilation speed of your JIT compiler is very important or not by selecting the compilers right here and see if they are green, which means they're not doing anything, or if they are yellow, which means they are compiling.

You can even go down to individual compiler selected here, a single compile that took here 700 milliseconds of a certain weird looking Java Scala method. You can actually go down to individual compilers and see how long it took and to see whether there is a problem with those specific one. It was just to show you here one of the things you can do with Java machine control that is less well known. Then, a vector like this, getting a faster compiler might speed up the whole thing, or doing Ahead of Time compilation might give you even more.

AOT vs JIT: Memory Footprint

The second area of Ahead of Time versus JIT compiler where Ahead of Time compiler has an advantage is on memory footprint. The reason is that in the JIT compilation mode of GraalVM on HotSpot, you need to keep a lot of things in memory, but obviously, you need to load parts of the JVM executable, you get the application data. That's clear, but then, there is a lot of additional data that is hanging around. One is, of course, the loaded byte codes and class files. Then, you have reflection meta-data because you need to be always ready to do a certain reflection call or do a debug session on your JVM, for example. You get the code cash, this is where you'll install the Just In Time compiled code that is created on the fly. As shown on the previous workload, this can be a lot of code that is generated. Actually, here on this workload, you can also see the compiled code size down here, it's 35 kilobytes. Down on the second to last line, it's 35 kilobytes on the method.

Then, you have here profiling data. You need to keep the data in memory that was used to feed the JIT compiler, and then finally, the JIT compiler itself, which hopefully, when the application reaches the steady state, will be no longer relevant because hopefully, no compilations are going on anymore when your application runs for a very long time. During the startup and during the warm-up period, there'll be a lot of compilations that also take memory.

In Ahead of Time compiler case this is much less available, it is much less that needs to be kept in memory. You basically load the application executable and you load the application data. The savings you get here depend on how much payload data the application has because the application payload data is the same on both configurations. If your application payload data is 32 gigabyte, you won't save anything, if your application payload data is 500 megabyte, you might save a lot. The typical lambda configurations on cloud providers is 500 megabytes for your whole application. These are scenarios where the metadata matters a lot. You can go to a lower configuration on your cloud provider if you go with Ahead of Time compilation.

It depends on how much application code is executed, meaning, do you have an application that's very complex, loads a lot of control files, has a lot of hot code, or do you have a simple application? We have some numbers here on Ahead of Time versus JIT, a memory footprint with various web frameworks, Helidon, Micronaut, Quarkus. The gains are still substantial. This is to serve a single request. The gains are about 3X, 5X, but it depends on how much application payload you have versus how much classes you load or code you execute.

An interesting tool, by the way, to measure performance, specifically, a memory footprint is PS Record. PS Record measures CPU consumption and residents' head size over time. This is important because a lot of Java Virtual Machine memory tools only show you the heap, the Java heap, and they basically say, "Well, this is your Java heap consumption. This is how many bytes you use in the Java heap," but that's not everything that's relevant. All of the metadata structures, compiler data structures, etc., are not included in the measurement. It's usually better to measure the memory consumption of a process by looking at the residents at size.

One thing about the residents at size is that it's highly variable because it depends on your operating system heuristics as your operating system is swapping pages in and out, and many other factors. This is why I would not recommend to measure the residents at size at a specific point, but measure it over time and then look at the overall graph, because if you measure at the specific point you might just get lucky or unlucky, you get a lot of variability in the measurements.

This is here running GraalVM in JIT configuration to the first request. What we see is, it takes a lot of real memory here on the operating system, about 350 megabytes, but also to see at the beginning is during the volume of the application, we use a lot of CPU time because this is where the JIT compilation happens, you see all these red dots there. Then, compared to Ahead of Time compiled mode, it is looking very different. First, the real memory usage here is only 12 megabytes. That's about 30 times less, but also the CPU usage has a lot less, even during startup, it's a lot less. During startup it's just a small peak, nothing else.

Here I'm doing two requests, I'm sending two requests to this server. It's just a small little spike, but then it's immediately over, whereas on the right side, this JIT compilation request could trigger additional compilations and suddenly have a lot of CPU for this one request. Of course, as you progress, if you run into an application for hours and millions of requests, your JIT compiler hopefully will not matter anymore and things will look different, but in your first couple of minutes or your first couple hundred thousand requests, these are big differences. This is here using PS Record to plot the CPU consumption and the memory usage over time.

When it comes to throughput and JIT compilation, I have a little quiz for you. I have here three ways to implement the negation of a value. Negate1 is return -a, negate2 is adding a local variable, doing some computation, and negate3 is using the full power of the Java object model to get there. I have a little quiz for you, which of these versions is fastest?

Participant 1: All of them.

Wuerthinger: All of them, correct. Specifically, if you use a modern compiler like C2 or GraalVM, all of these three will compile to the exact same machine code. In the second scenario, of course, it's the usual canonicalization the compiler does, and then the third scenario is boxing elimination that the compiler would do and the probably is specifically good at. All these three the compiler can prove and understand what the program does, and reason about it in a way to produce the same machine code. Now, you would say, "Why would that ever matter? I'm not writing this type of code." Yes, hopefully, you're not writing this type of code, but on the other hand, if you use very good abstractions within lining and utility methods, these type of patterns often occur with the compiler system, because all of these might be hidden in a couple of utility methods that you use at a high-level abstraction and then the compiler can fold them all.

These things occur quite often in our JIT compiled code. In the JIT compiler itself, the JIT compiler has literally 1,000 different optimizations. I've seen a lot of talks about, "Well, I'm comparing this JIT compiler versus the other JIT compiler. It's a race between JIT compilers, etc." The truth is, JIT compilers literally have 1,000 different canonicalization optimizations in them. It's very hard to compare them specifically if you only look at one tiny one, because where it matters most is on a big graph. On a big complex graph of methods and graphs, this is where it actually matters most. Whether the cheat compiler is optimizing this little tiny micro over there might not be that relevant.

It's interesting to look at, and can be relevant, is the hottest loop in the application. A lot of real-world applications, however, have literally hundred thousands of lines of hot code, and the JIT compiler will get the performance only by optimizing everything together. It's harder to do these comparisons. Actually, the third version here will run slower if you run with the client compiler only in HotSpot, because the client compiler does C1. Faster compiler does not do the same escape analysis, it will run that slower. You might be tempted to speed it up and write this type of function here.

Let's just cache something, because caches are good. The fun thing is, this fourth example here will run faster on the client compiler, but it will run a lot slower on the C2 compiler and the GraalVM compiler because suddenly for us as a compiler, this is what we call a value escape. This is what the compiler really hates, as a compiler constructed hurts me because this is like somebody writes the value to somewhere where I no longer can prove around it, I can no longer reason about it. This is really something that suddenly we can no longer remove the object allocations because another threat could see that, and you would need a lot cleverer interest rate analysis to figure out maybe that no other threat can see the value.

In Ahead of Time compiled mode maybe we'll also be able to optimize the last one at some point, but the last one is not recommended. This is a general rule of thumb in terms of getting maximum throughput, is keep everything local. Local data structures have practically free as long as the compiler can prove that it's read local. Be very careful about what data you write globally that can be seen by other threats, because a Just In Time compiler will always try to optimize in a way where if it can prove only the one threat sees the value, it can do much better.

Performance is hard to measure, this is our screenshot from my internal performance tool on a certain benchmark, on a certain Scala benchmark. Notice, this is like running the benchmark to peak performance, so this is not warm-up, this is peak performance of the benchmark run. We see that this seems to vary a lot, but it's not like every check in. This is running on every check-in of the compiler. I get 1,000 of check-ins, but it is still showing a lot of our ability. The problem is that there's a lot of variability in the way a JIT compiler with the greatest performance, because the profiling feedback that it has as an input is actually gathered in a not thread-safe way, and has a lot of variability in it. Then, when the JIT compiler gets a good profile, maybe it gets lucky and produces a better version of the machine code a little better, or maybe it gets unlucky, and then you're a lot worse.

We see quite a few benchmarks in our internal system where you have two states and you're alternating between them. You're either fast or you're slow, but when you're slow, you keep being slow, and when you're fast, you keep being fast. It's not a variability between individual iterations of the benchmarks. It's really the peak performance, the final steady-state peak performance it reaches the benchmark, it's very hard for us. This is not zoomed out over one and a half years in that benchmark. This is the work we do to improve things. Over time, we seem to be getting better here, that's good news, but it's something that's still very hard for us to make improvements that are in just 1% improvement, for example, because it's hard for us even to measure whether we actually improved something.

That's a pretty big problem because if you improve 1% every week, you will double your performance over one and a half years. For us, it's a big challenge to get good measurements on this thing. This is also for you, when you measure performance, you need to be careful about repeating your experiments, getting error bars, etc., because one run might not see anything. You got a really bad run here somewhere six months ago. I had a bad check-in or the machine was doing something else, I don't know. We got a couple of really good runs here, so you should analyze what that was. Generally, when you measure performance make sure it's repeatable, you measure a couple times, etc., because there's so much vulnerability in the system.

AOT vs. JIT: Throughput

On AOT versus JIT throughput, this is GraalVM AOT mode and the top is GraalVM on JIT mode, the blue one. This is running a web server again. What we see is the JIT compiler produces, in the end, the better machine code. However, as you can see here, this only pays off at about 100,000 requests. After about 100,000 requests, the Ahead of Time compiler is still faster, but then the JIT compiler takes over. The JIT compiler does better because it has knowledge about how the application functions, it has the profiling feedback, and it can be more aggressive about that. One of the things you can do with Ahead of Time compilation, is you can do profile guided optimizations as well in Ahead of Time compiled mode. It's a little bit harder to set up, so from a manageability perspective, it's harder.

What you can do is, also with GraalVM, you can run native image with - PGO instrument. It gives you an instrumented binary, you run that on a couple of example workloads. This gives you a profiling data, you feed that profiling data into another run of native image, and then you get your final fully profiled, fully optimized executable. This depends on whether your workloads are relevant or not, how good your final performance will be, but the good news is the profile doesn't need to be 100% accurate. It's ok to be approximately ok-ish, you laugh, but that's important for us, because even when we are running JIT compilation mode, it's important for us to not 100% trust the profiles, because the profiles are just an approximation. If you would go 100% on the profiles, you would see even more random weirdness in performance, and so we actually, as a JIT compiler, need to be like, "Yes. Maybe that's a recommendation, but yes, we need to be relatively careful not to base too much on the profiles."

On the Ahead of Time-compiled version of this, it makes the throughput a lot better with GraalVM AOT and it basically moves the point where HotSpot JIT with GraalVM overtakes GraalVM AOT to about a million requests. We then go a little bit down, that has to do with AOT. AOT binary doesn't have such a good garbage collector, and then at the later stages of the application the performance of the garbage collector is getting us problem in the AOT mode. Generally, with PGO you can get relatively close to your JIT code.

AOT vs JIT: Peak Performance

To compare AOT versus JIT on peak performance, the JIT has an advantage because this profiling at startup. The other thing it can do is it can make optimistic assumptions. It can make the assumption that for example, a certain path is not taken and that later bails out of it. This is what's called deoptimization in HotSpot, where you optimize but then you deoptimize if it turns out that your optimization was just not accurate. This is something that JIT compiler can do because it can re-compile. The Ahead of Time compiler cannot do that. It needs to handle all cases except for when it can prove statically that something would not occur, but needs to be ready for all theoretically possible cases. The JIT compiler can be much more aggressive.

On AOT we need to handle all cases. Profile-guided optimizations help us a little bit because we can then also optimize better. The advantage AOT has here, and this is a pretty significant one, is your performance is more predictable, because one of the downsides of the JIT is the profile it takes is from your startup of the application. It tries to optimize for the common case, it always tries to optimize for the common case, but what if I care about the exceptional case?

Let's say I have a trading application and the common case is nothing happens, and then suddenly, boom, the market crashes. Then, in this scenario, the exceptional case happens, my code is not optimized for that, I might run, go back to your interpreter, and it might be specifically slowing that exceptional case. In a scenario where you want to optimize for the exceptional case, the AOT with profile-guided optimization is actually significantly better, because you can train on your exceptional case, get the profile, and then compile with that profile instead of training just for the common case.

Generally, I like benchmarks and I think we should have more. The main reason for that is optimizing a compiler for a small number of benchmarks is the equivalent of doing over-fitting in a machine learning algorithm. It's like your compiler will be really good on the benchmarks that you trained him for years, decades, but then you give him a little bit of a different program and suddenly, boom, the heuristics no longer work, etc. We really need to be careful as compiler constructors to avoid overfitting. Ideally, maybe we can learn here from the machine learning community and have test set, validation set, and hide sort of the validation set from the compiler optimizers,

This is quite important and this is why we actually started a little project with some academic collaborators that are trying to create more benchmarks specifically for modern Java workloads. We call it Renaissance Benchmark Suite. My personal opinion is that all benchmark data is useful, careful with the conclusions, of course, as always, because first of all, you might not measure what you think you measure, and second, it might not mean what you think it means. Still, it's important to get the data and get as much data as possible even if conclusions might not be that definitive as you might want them to be.

Renaissance.dev - I don't go into details on this, you can look up the website. This is a benchmark suite that has quite some Scala code in it. Kafka, Spark workloads and workloads that are typically underrepresented in traditional benchmarks like SPECjvm2008 or SPECjvm2015.

AOT vs JIT: Lax Latency

AOT versus JIT, max latency. Max latency is more a Garbage Collection domain, usually. That's why I don't talk about it in that much detail. For latency usually, you go to Garbage Collector first because that's often your biggest introductor of latency problems. There are a couple low latency options in HotSpot, G1, CMS, ZGC, Shenandoah. In AOT compile code at the moment, we only have a regular stop and copy collector. This works well if your heap is very small, if you're in 100 megabytes, 200-megabyte heap configurations for your web server or for your Lambda functions. It doesn't work well when the heap is big because you get a lot of latency. One thing here is, because your startup is so fast, you can actually use this to shut down your process instead of doing as you see. If you have a configuration where you have a load balancer, where you can immediately just start a new process and give the work to a new process. It's not clear where in all setups that you see is that important.

The other thing here is that with native image, you don't need to add a lot of applications onto the same application server, but you can have one native image per app which reduces the influence the apps have on each other. Latency has sometimes tradeoffs versus peak performance. One is loop safepoints, when you have a loop, we are usually removing the safepoint in the loop that would allow a thread to stop for garbage collection. This gives us better peak performance, and we do that if the loop is counted. We prove that you will at some point exit the loop. This is an optimization that has peak performance, but can introduce more latency because the threads take longer until the garbage collection can finally happen.

Parallel stop-the-world GC is usually the best for peak throughput, so all you care about is the final answer and you don't care about latency parallel, stop-the-world GC is probably best. One thing of latency is GC is important, but if you have a low latency garbage collector, then suddenly your compiled code performance will also matter again, because it doesn't help me if I don't stop for garbage collection, but my request just takes 100 milliseconds long. It doesn't help me, both of these need to be optimized in the end.

AOT vs. JIT: Packaging Size

Finally, AOT versus JIT packaging size. On the JIT site, you can use a new JDK. There is an awesome tool called jlink for smaller packaging that allows you to package your Java applications. You can run the JIT compiler on a lightweight Docker image. In the JIT mode, you have a big constant overhead for packaging because you need to include class libraries, from the JDK you need to include the binaries for HotSpot, etc. On the AOT side, you can have everything in a single binary and you can run it on a bare metal Docker image. You can actually run it on a very slim Docker image. We get Docker image size for smaller web servers total of 7 megabytes or something like that, so the constant overhead is smaller. Here, if the application is super complex, the AOT mode is starting to look a little worse. If you have hundreds of megabytes of JAR files, let's say, then the final machine code might be also quite big because the machine code, depending on your optimizations, is larger than the byte codes, but it depends on many factors here, also how long your class names are, etc.

On peak versus packaging trade-offs, when we create the native image, we can actually ask the compiler to produce a bigger binary size for better performance or a smaller binary size for less performance, and that is because many compiler optimizations, in the end, boil down to duplicating code. This is the model for optimization, it’s duplicating code and optimizing it for a specific context or path. In this peak versus packaging trade-offs, we can have inlining and code duplication which helps us with peak, but increase the size of your packaging, or in the JIT mode increase the amount of memory occupied by a Just In Time compiled code into code cache. In the JIT mode, this is actually a trade-off of memory footprint versus peak. In the AOT mode, this is a trade-off of packaging size versus peak.

To summarize and conclude, GraalVM JIT versus GraalVM Ahead of Time. At the moment, you use GraalVM JIT when you're interested in peak throughput, specifically the Graal compiler and enterprise edition is totally optimized towards achieving the best possible machine code at the cost of some compilation times. GraalVM JIT max latency - if you're interested in that, use GraalVM JIT because you can use the garbage collectors from HotSpot and you don't need any configuration because one of the things is that in our Ahead of Time compiler mode you currently need, depending on your reflection uses of your application, you might need some configuration to create the Ahead of Time compiled image. That is because in the Ahead of Time compiled mode we need to guide the closed world analysis towards being able to produce the final packaging.

In GraalVM AOT, you use that if you care about startup time memory footprint. For example, also maybe we are 20% slower in throughput, but on throughput per memory, we might be a lot better, and depending on your deployment, this might be actually what you care about. Packaging size is usually smaller. This is all about trade-offs, but the question is, can we get AOT better? There are a couple things we're currently exploring to get AOT better on the areas where the JIT compiler currently is better. One is the profile guided optimizations where we collect the profiles up-front. Another one is we're working on a low-latency GC option also for the native images. This puts at least some max latency, the two versions at the same level. Then, we are continuously working to improve the way you can create native images by having, for example, a tracing agent that will automatically create your configuration. Currently, if you want to use native images, the best bet is to use one of the frameworks that has support for it, meaning, Helidon, Micronaut, or Quarkus, because those frameworks help you with some of the configurations you need to create that small thing.

GraalVM Can Do Much More…

GraalVM can do much more. This was the talk about just JIT versus AOT, but GraalVM itself can do much more than that. It can run many languages. This is a Node.js application using Java and R to plot something. If you're interested in any interesting things, I recommend "The Top 10 Things to Do with GraalVM" blogpost, and our Medium site that has all the variations you can do. It's a big project, it's a research project, but now we are also in production, so also not actual product, but you can still a little bit feel this is a research project, because there are many additional areas we are still continuing to explore with GraalVM.

GraalVM is an ecosystem with a multiplicative value-add. You can run many languages. You can have language-independent tooling interoperability optimizations and you can run them in the HotSpot JVM, Node.js, standalone AOT, etc. We absolutely welcome anyone to add your own language or add your own embedding, or otherwise make use of GraalVM. It's a big community. It is in graalvm.org, you find documentation, downloads, etc. We are on GitHub, this is our GitHub repository, our main repository that has all the optimizations in there. You see many commits here from our core team. We welcome anybody to join us here in that community on Twitter, GitHub, or on the website.

Questions and Answers

Participant 2: Have you considered using Epsilon in the benchmarking? I guess the other question when it comes to garbage collection is, for AOT, why would you not use a parallel collector, because the other options for those size heaps are much more expensive in my experience?

Wuerthinger: The question was, have you considered to use Epsilon or also the parallel GC? We have not yet done Epsilon, but I think it's an excellent suggestion. I think Epsilon would totally make sense. Epsilon is the garbage collector that doesn't do any garbage collection at all. It can absolutely make sense because the startup is good, you shut down the process. I think we should certainly add the Epsilon option. On the parallel GC, for max throughput, yes, but I'm not sure whether in a typical native image it would make that much of a difference. We are currently more exploring to get a Chiffon style garbage collector into native image. You don't like that.

Participant 2: No. The thing is, if you're going to put a concurrent garbage collector and you're going to take a throughput hit on the allocators, which you don't take with the parallel collector and with small heaps you don't take the pause time hit that you do with large heaps, and so with the G1, you're going to get about the same pause time characteristic, but you're going to get about a 10% to 15% hit in throughput, meaning that you're going to take a throughput hit not really benefit in terms of the pause time characteristics of the whole thing, and you're going to have to run extra threads to do things like our set refinements and things like that, which means you get yet another power hit in terms of processor.

Wuerthinger: You're right. Yes.

Participant 2: So for a heap that small, we still highly recommend people use a parallel collector over anything else.

Wuerthinger: Yes. That's a good suggestion because if the heap is small enough, the power collector, even if you stop everything, it will not introduce too much additional latency. Yes, that's a good point.

Participant 3: On the same topic, are we basically saying Ahead of Time compilation might not make sense for large heaps or do we think that we could also find a place for large heaps with empty?

Wuerthinger: Yes. That's Ahead of Time compilation sense for large heaps or not. I think you can at least make sure that for large heaps, you don't have a disadvantage versus the JIT compilation. The thing is, large heaps are typically for use, but Java applications that are running for a very long time. When you have a large heap, your application payload data is so big that the saving some memory from the Ahead of Time compilation all is not that big. On your typical application that uses a huge heap and it runs for long, I don't think there's a lot of reason to Ahead of Time compile, and therefore, we can make it that it practically gives the same characteristics, but it also doesn't give you the benefits.

See more presentations with transcripts

Recorded at:

Jul 22, 2019

Thomas Wuerthinger

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?