BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Graal: Not Just a New JIT for the JVM

Graal: Not Just a New JIT for the JVM

Bookmarks
37:55

Summary

Duncan MacGregor takes a look at the differences between C2 and Graal, what this can mean for the performance of a code, and what else is possible with this new JIT.

Bio

Duncan MacGregor has worked on language implementation for several years including porting GE's Smallworld GIS system to work on the Java virtual machine. He currently works for Oracle Labs on the TruffleRuby project.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

MacGregor: This talk is Graal: Not just a new JIT for the JVM. Apart from the safe harbor statement, I think the best way to start this talk is by asking what the problem is. Why do we need a new JIT for the JVM? What problem are we trying to solve that HotSpot, and C2, and other JITs that exist are not good enough for?

And the problem is bits of code like this. It's very nice code. It's easy to read. You can see that it takes an array from something, it turns it into a stream, It performs several map operations, and it produces a result. It's got lots of good properties. It's hard to get it wrong. This sort of code does not mess with the underlying data structure that it's streaming over. It can be made parallel very easily, and it can be decomposed into multiple methods, or it can be composed easily as well. So if you know that your arrays, the data structures you're going to be going over are large, or the mapping operations you're going to perform are going to be complicated, you can just put one method call in there and make it parallel. And that's a really big advantage.

So why don't we use code like this everywhere? The number one reason is that we want better performance than we can get from this code on current JITs. Running it involves creating a lot of temporary objects, and that imposes a cost. It imposes a cost as well because we're using lambdas, and lambdas work very well with JITs in some circumstances. But the strategy you use for dealing with them can make them work much less well in others. But, as I say, already, it's really easy to write this and to get it to do what we want.

What's Going on in this Code?

What's going on in this code? Arrays.stream creates a thing called a spliterator. It's an iterative. It can be split across parallel cores, it can do all sorts of clever stuff. But in this case we just want to iterate over an array. Calling map on it and passing in that lambda creates a new object, which is also a stream. And that has to be created, and it passes in the lambda, and it's going to be calling that lambda. Ditto for the next two method calls of map. They are also going to create stream objects under the hood. Then finally, at the end of all this, we call reduce, which is what the JDK calls internally a terminal operation and that actually performs the loop over the stream to produce a final value in some way.

Here we go. What does the JIT need to be able to do to make this sort of code run fast? It needs to be able to inline methods. These temporary objects are all implemented as interfaces and they've got loads of different implementations and optimized for different things. So we really want to be able to take all those virtual method calls, boil them down to well-known single dispatch, direct calls that don't involve a jump in the assembly code or in the machine code.

If we can do that, then we can know a lot more about how this code is working and we can start to do something called escape analysis. Escape analysis is kind of what it sounds like. We're looking at all the temporary objects being used by a bit of code, and we're trying to understand which ones are going to get out of that code and be seen by the rest of the system, and which ones really are temporary. Most of the time we can avoid allocating those entirely. They'll just be broken down to their individual fields and stored on the stack in some clever way. You can do even better, you can do partial escape analysis where you know that in the normal path of code the object does not escape. But maybe if there's an exception, then the object escapes. And if you can do that, then you can start to do some stuff around reifying objects only under those exceptional circumstances. Most of the rest of the time you just don't create them at all.

How Well Do C2 and Graal Do at this?

How well does C2 do running this little bit of code? I ran a small benchmark to do with streams, pretty much the same sort of thing you saw on that first slide. I looked at the assembler output and various bits from the compilers to see how it was doing. C2 does pretty well on inlining. It actually did a bit more than I expected. So in running this small case, we're actually doing quite well on how lambdas are used. C2 likes to inline code starting from the smallest method and working its way out. That means it really wants a method to always be calling the same callee.

That's broken in lambdas because we've often got a vast number of different lambdas funneled through a single method, and that can start to break me inlining heuristics that our JITs have used. But in this small example, C2 does pretty well. It also manages to do pretty well at escape analysis. It doesn't remove all the temporary objects. You can still see some allocations of them if you read the assembly code, and you've got to be a bit of a masochist to want to do that. But some of us do every so often. So, it's done quite well. It's removed quite a lot of temporary objects, but it's still creating some, and you can see that in the assembly code and you can see if you instrument how many objects are being created. The thing it doesn't manage to do very well is turn this into a simple loop, which is the thing we would really, really like, because that's nice and efficient for this small case.

How well does Graal do? Graal is more aggressive in the way it does inlining, and it has some more strategies for trying to do it than C2 does, so it manages to inline more of this code. As a consequence of that, it manages to do much better escape analysis. It removes almost all the temporary objects in this benchmark. So, it's mostly looping over the array and adding up the numbers. Because it's mostly doing that, it is able to turn this sort of small thing into a tight loop that actually executes pretty efficiently. Not quite as efficiently as an array implementation, but not bad.

What Effect Does That Have?

What's the effect of all that? This chart shows this benchmark being run on two things. The blue line labeled HotSpot is standard C2 compiler. The red line labeled GraalVM is a JVM that's been built with the Graal compiler, and we'll talk about more about that later. You can see that the peak performance is pretty good, went more than twice as fast compared to C2 on running these bits of code with streams in. But you may have noticed there's a bit of a problem over on the left hand end of the graph, which is that it's taken us quite a while to get there, and that's not really what we want.

What problems do we have here? We’ve got that warmup, as I mentioned. The warmup is partly because Graal is written in Java. Now, this causes a lot of people to blink every so often when we say that, but lots of compilers for languages are written in the language. As long as you can bootstrap it, you're okay. In Java, we've got an interpreter so we can bootstrap our JIT. It'll be slow if we always ran it in the interpreter. So a lot of that warmup is we really have to just in time compile our just in time compiler. This can goes many layers deep as you want, but it's got some other properties not entirely desirable.

We're running our JIT in Java, which means we're sharing the heap with your application. If you've sized your application carefully to fit in a specific amount of space, that's going to have an undesirable effect. Every time you compile, there's a bunch of objects that get created to represent all the information the compiler uses internally and then they're going to be garbage collected away again. If you've been tuning for low-pause garbage collection and things, you've probably been very careful to reduce the amount of garbage you produce as much as possible, and you don't want it stopping your Java application.

We also are polluting the type information that the compiler gathers about your program. If you are careful that you only use collections in particular ways, you may be getting some very good performance from the JVM because the compiler can reduce virtual calls down to direct calls and inline things. But if we're running in the same JVM, then that may no longer be the case. Because inside Graal, we'll be using a load of collection methods, and classes, and things like that, and iterating over graphs, and that may start to affect things in an undesirable way.

Will it affect you? You can try this stuff at home. If you've got to a up-to-date JDK, JDK 11 is a good one to pick because it's out and it's the most recent supported version, you can do -XX:+UnlockExperimentalVMOptions and UseJVMCICompiler, and you can use Graal instead of C2, and it will just work, and you can see whether it makes a difference for your application. But you can also see some of the problems. If you specify BootstrapJVMCI on the command line, then you can see how long it actually takes to compile the JIT itself. On my machine it's about eight seconds. And normally, we want startup time to be reduced, and that's really not helping us.

How Can We Have Graal without the Downsides?

There we go. So now we have a new problem. How can we use Graal and get all the benefits of it without having those downsides? What can we do to achieve this? Graal isn't just a JIT. The difference between a compiler for something like C, that does everything ahead of time, and a compiler for Java like Graal, that is doing everything just in time, is not as big as you might think. A lot of the components are used in common. Indeed, Graal can be used to do ahead of time compilation and it can be already be used in the OpenJDK to do some. And that's where the tool that's in Java 11, and I think some earlier versions, but I haven't checked, called JAOTC, Java ahead of time compiler.

But ahead of time compilation can mean quite a lot of different things. Just in time compilation on the JVM. On the right hand side we have the VM internals which are pre-compiled C++ and C. You've got your garbage collector, you've got underpinnings of class loading, you've got a compiler interface because there's multiple compilers inside the JVM. There's C1 and C2, and there's a more general compiler interface called JVMCI, which you saw mentioned in those command line options, which is what Graal uses to interface. That's all built in advance and is in various shared libraries that link you to the Java executable.

On the left hand side we've got our JVM byte code. We've got some classes of ours, MyClass, MyOtherClass. We've got a module of Graal in this case. Officially it's called JDK internal VM compiler in the OpenJDK, if you look at the module names, I think. We've got things like the java.base.module, and everything that you need for your code to run. So ahead of time compilation for this looks something like this. We're not touching the VM itself, but over on the left-hand side, we're turning some of our classes into shared libraries, so .SO files on Linux or, whatever they're called, DYLIBs on Mac OS and things like that.

This has some tradeoffs. We've got to do everything up front and ahead of time. We don't get to do all the tricks that a just in time compiler will get to do. We can't make an optimistic assumption about a particular aspect of the code and deoptimize when that ceases to be true. We have to make slightly more conservative assumptions, so that we've got something that will work all the time. And there are some limitations as well with this approach. If you do this thing of creating shared libraries from JARs, those have to be compiled with the same JVM options that you're going to be using it runtime. So you're baking in stuff about instrumentation, and garbage collection, and all sorts of things like that. It's useful, but it's not necessarily what we're after.

There's another option you can have for ahead of time compilation, which is to say, "I want to boil everything down to a single executable." I'm not sure this is supported in OpenJDK yet, but again, we'll talk about this more later in the talk. And the idea of this is you're getting very minimal amounts of the VM infrastructure you need. For example, maybe I don't need any of the other runtime stuff apart from a garbage collector, my classes, and probably some more bits of the standard library over on the left. That's another way we can do ahead of time compilation.

Do Either of These Really Help with Graal?

So does either of these options really help us with Graal? The shared library option doesn't help us in several ways. We've got the limitations that I mentioned of having to run with the same options and things like that, but we've also not solved the problem of running in the same heap. We're still using the Java heap, and that's a problem. If we build as a standalone executable, then we haven't exactly got a VM that we've got a JIT for. We've got a JIT sitting on its own, so that doesn't look like it really helps us either. Question is, "Can we come up with some middle way that's going to have the good parts of a standalone executable, but it's going to be usable inside the JVM?"

And yes, we can do something. We can compile stuff to a shared library. In this case we have a VM that looks mostly similar on the right-hand side. It still has the GC, it still has all the class loader and stuff. It's got a slightly new compiler interface because we're no longer expecting Graal to talk to something running on the main JVM. We're expecting the main JVM and Graal to talk in some other ways. So, we've got a slightly different compiler interface. And we've got this thing at the bottom called libgraal.so. Technology, isn't it wonderful? Clickers never quite work.

If we expand out that VM, libgraal is a shared library that contains its own garbage collector. So it's not using the Java heap. It's got all the bits of runtime it needs just to be able to run, and it's got all the internals of Graal compiled. This fits our needs. It's no longer using the main Java heap. It doesn't even have to use the same sorts of garbage collector. We can choose something completely different and choose one specifically designed for the types of tasks we expect libgraal to be doing. It's no longer going to be polluting the type information, and it doesn't have to JIT itself at the start. So that seems like a really good thing to be able to do.

How Do We Turn a Java Library into Something We Can Use?

How can we turn this into something we can use? How are we going to do this? This is a research project inside Oracle labs called Substrate VM. It is a whole different lightweight VM for running existing code with some limitations. The idea is you take an existing Java application on HotSpot, and you transform it into an executable or a shared library. When you do that, you stop running it on HotSpot and you're running it on this tiny custom built VM, most of which is written in Java, because we do things like that. So we take your application, we take the JDK, we take Substrate VM, we do a load of static analysis, we work out what things are reachable in your application. If code can't be called, we're not going to include it. Like I said, there are some limitations around things like reflection. You can't necessarily call stuff unless you describe in advance that that sort of thing is going to be needed. We boil that down to some executable code, a sort of serialized version of the Java heap on disk because, "Hey, you need a lot of that stuff to run to run things with that startup." We package it into an executable or a shared library. You do that, and then you run it as many times as you want.

Can We Build More with This?

So that's all very useful. That gets us an interesting new JIT written in Java into the JVM, and hopefully without problems of warmup, or type pollution, or using memory that you don't want. But can we build something more with this? I mean, just having things faster is good. As Martin mentioned at the start, Twitter have been picking this up because Chris Tollinger really wanted to use it, and he's been contributing stuff to Graal as well. And they've had very good outcomes from this. They're talking about 20% speed improvements and things on tweets. So that's great.

But what else can we build? We can build a lot of interesting things and they collectively go on under a banner called GraalVM. GraalVM, at its heart, is the Java VM and the Graal compiler. And we can use that for running Java code, and Scala Code, and things like that, any JVM language. Hopefully, we'll get better performance than we did out of HotSpot's C2. But on top of it, we can build whole interesting new things.

This is a framework called Truffle. And Truffle is designed for implementing interpreters for languages. The idea of this is that you don't have to write a complicated compiler for your language. You can write a simple interpreter for it, and it is surprisingly simple. As long as you follow some rules, the framework that runs your interpreter will be able to understand when parts of code are run frequently, and it will be able to really optimize them surprisingly well. We've got various languages implemented on top of this. We've got Ruby, in the form of TruffleRuby, which I work on most of the time. It's my day job. We have an implementation of R called FastR. We have Graal.js, which is an implementation of JavaScript, and I think it's ECMAScript 6 or whatever, one of the recent ones they've been keeping up, and we've got an implementation of Node.js working with that. We have a Python port.

Some of these languages, in fact, all of these languages have one thing in common. They've traditionally depended on extensions to their systems written in C. We have something that can interpret LLVM bitcode, so you can run your Clang and tell it to output this bitcode rather than compiling down to your native platform. We can interpret that in the same context as the languages, and you can make calls to those into that C and back again into the language. And this is no longer an optimization barrier. If you're doing something simple done in a C extension for Python or Ruby, a just in time compiler for Python would not traditionally ever be able to deal with, now it can. If it can C, you're just taking an object, doing something simple to it, and returning that new thing, then it can optimize that.

We can use this in a lot of interesting ways. Combined with the Substrate VM stuff I talked about, we can take language runtimes and we can put them in places we haven't before. So yes, we can run stuff on the OpenJDK. We can run a JavaScript implementation inside Node. We can run this stuff inside the Oracle database. We can run it inside MySQL. So if you were thinking of doing a large database upgrade that involved a load of Ruby code or something that really wanted to be run as close to the database as possible, that sort of thing starts to become a possibility. You can run standalone executables, which is very useful, and we'll come to in a minute. There's a fully open source implementation of this on Oracle/Graal, if you want to build it yourself or look at things. And we have versions you can download. We have a community edition, and we have an enterprise edition which does more optimization.

How does this trick with Truffle work? It's a trick called partial evaluation. The idea of this is that you run an interpreter over your program, and it's not the normal sort of interpreter that you think of if you've ever seen a byte code interpreter for the JVM. Because partial evaluation is essentially running your program with every input that it could have at the same time and figuring out what's constant in that. If you have a loop that always counts to 10, partial evaluation should figure that out and be able to produce a flattened loop that always counts to 10. But equally, if you've inlined something that's taking a boolean argument, it will only need to compile that one version. It can do quite a lot of tricks, and it's working very effectively us in these dynamic languages.

Graal works very effectively, for things like Scala. I said streams are a good target. And Scala uses a lot of streams and things internally. So we get good speed up on things like Scala. We can also apply Substrate VM to things like this. So we can take short running jobs like the Scala compiler itself, and substantially improve its performance at compiling things, because we're removing that initial warmup and startup time. Substrate VM process has already done a bunch of initialization. When you're running it, it's very quick. It's a couple of milliseconds at most to start normally on a bunch of these things, because it's already done a bunch of work, and it's got a heap ready to go. So if you're looking at reducing build times and things, then this is a viable strategy. And maybe it will be a strategy for even the Java compiler going forward. Who knows?

We're also getting excellent performance out of the other language implementations. We're comparing to other JVM based systems for this, and certainly for things like Ruby, and R, and stuff, we're doing a lot better than other implementations. This is a strategy that seems to be working well. For the JavaScript side of things, it's doing a lot better than Nashorn or Rhino, which is good since Nashorn's been deprecated. So we'll need a replacement script engine for that at some point. You can try using Graal.js as a script engine in OpenJDK 11, if you download it. It's on Maven. You can try it in GraalVM. So this opens up a lot of possibilities for things like that, to have script engines that are truly performant on the JVM.

On the Ruby side, we're a lot better, especially on small benchmarks, than any other Ruby implementation out there. We're scaling up at the moment. Ruby's got a lot of challenges to doing optimization. But now we've got a JIT and a framework, but we can start to approach those challenges there. At the moment we've been improving compatibility, but now we're getting to the stage where we're starting to scale up and running real Ruby applications, and we're going to be looking at getting that performance as good as we possibly can.

Try This out for Yourselves

You can try this stuff at home. On OpenJDK, you can add the UnlockExperimentalVMOptions and UseJVMCI, and that will allow you to use Graal right out of the box. It's still experimental. So, maybe you don't use it in production unless you've got a team that can support it. But you can try it on OpenJDK 11. You can also get GraalVM. If you go to graalvm.org, then you'll find the downloads for the community edition and the enterprise edition, and we're doing release candidates at least monthly at the moment. You'll find a load of stuff in GraalVM docs about how to use this stuff, and how to get started on using various languages, and download those, and give them a try, and use them in Substrate VM. Or build Substrate VM versions of your own applications.

People have been trying this with Netty. They've been trying it with Spring. They've been submitting patches to those frameworks to ensure that they work in this sort of environment. If you're aiming at something like AWS Lambda or something like that, or any serverless case where you want to start up quickly, then this is the thing to look at. You can also follow GraalVM on Twitter, and there are Twitter accounts for the individual languages as well. There's Truffle, Ruby, and you'll find many of the team members like myself on there as well. I'm @aardvark179 on Twitter.

Graal is not a small project. It's a large team and it's a great collaboration between Oracle Labs and various universities. It's finally making its way towards being part of the OpenJDK. Any questions?

Questions & Answers

Participant 1: Has anybody looked at GraalVM targeting WebAssembly?

MacGregor: Targeting?

Participant 1: WebAssembly, as in the binaries that you can put into the browser.

MacGregor: I'm not sure if they have. I'd have to check on that.

Participant 1: It would be an interesting combination.

Participant 2: Is the work done in OpenJDK for the Graal code base, or is it developed elsewhere and then merged into OpenJDK? And if the latter, what sort of frequency do you have drops within the OpenJDK for that?

MacGregor: It is developed in its own repository at the moment. That version is structured slightly differently because it's a multi released JAR, so it will run on a modified JDK 8, but also on JDK 9 and outputs. The drops, I couldn't swear to how often they are done at the moment. They have some irregularity in the frequency, because some changes take longer and need more time to bed down in the Graal repo than others. So libgraal, for example, is a project requires changes to JVMCI and large internal changes to Graal. Something like that takes a while to land. So sometimes differences build up. Important bug fixes, however, do tend to get ported across very quickly.

 

See more presentations with transcripts

 

Recorded at:

May 06, 2019

BT