Transcript
Beckwith: I'm Monica Beckwith, a Java champion. I work in optimizing the JVM at Microsoft. I'm going to talk about our journey with enabling Java on Windows and ARM64 systems.
Outline
I'm going to start with introduction to OpenJDK, and ARM64. I will then talk about our port, by providing background, a few new nuances that we gathered on the way, and then the timeline. I will then jump to testing and benchmarking. Then talk about next steps. Then I'm going to provide a quick demo.
What Is The OpenJDK Project?
Let's first talk about the OpenJDK project itself. The JDK in OpenJDK stands for Java Development Kit. OpenJDK is a free and open source reference implementation of the Java SE. It's licensed under the GNU GPL version 2, with Classpath Exception. Let me provide a quick timeline with respect to OpenJDK becoming open source. Many of you may know that OpenJDK used to be SUN JDK. Back in 2006, Sun open sourced the Java virtual machine which is called HotSpot. Then in 2007, Sun open sourced almost all of the JDK itself. In 2010, 100% of the JDK was open sourced. 2010 was also the time where Oracle acquired Sun.
The OpenJDK Community
Let's look at the same timeline, but with respect to the community involvement. By 2007, we saw Red Hat signing what is now known as the OCA, the Oracle Contributors Agreement. Back then it was called Sun Contributors Agreement. Then there's also something called the TCK. Red Hat also signed the TCK. Next came the Porters Group. This is very important because Porters Group helps with bringing the OpenJDK to newer architectures and operating systems. That group was formed in 2007. By 2010, we had IBM, SAP, Apple, everybody being involved from the OpenJDK Community. In 2013, a star was born, and that star is Microsoft. Microsoft collaborated with Azul Systems to provide the best experience for users on Azure.
What Is ARM?
ARM is what we know as RISC architecture, which is the short form for Reduced Instruction Set Computer. Basically, RISC provides highly optimized instruction set. It also has a large number of registers. The one thing that's important to know about ARM is the load-store architecture. For commodity hardware out there, you may be familiar with x86-64. When we're trying to access memory in an x86-64 architecture, we have the data processing instructions that can directly access the memory. That's not the case with respect to load-store architectures. With load-store architectures, actually, you have to access memory via specific instructions. These instructions are called memory access instructions. Usually, you load your data into processor registers, and then you store it into memory. It would look something like this. An example would be LDR Rt and address. Here Rt is an integer register.
What Is ARM64 aka AArch64?
Many of you may know ARM64 also as AArch64. In Windows world, we call it ARM64 because it's the new 64-bit ISA defined by ARM. What that means is that your integer registers, your data, and your pointers are all 64-bit wide. Many of you may know ARM as a weaker memory model, with ARM64, multiple copy atomicity was introduced. ARM64 started with the ARMv8 ISA. Prior to ARMv8, the atomicity was only single copy, which means that all threads are not guaranteed to see the write simultaneously. With respect to these weaker memory models, if you need to enforce the order of operation, then you need barriers and fences. Basically, barriers and fences are needed for access ordering. An example would be instruction synchronization barrier. An ISB instruction flushes the CPU pipeline as well as various buffers.
The other thing that I wanted to highlight about ARM64 is the release consistency model, which means that it provides one-way barriers such as load-acquire and store-release. It's an optimization. Here you see a code snippet from a code base. Here, we're trying to provide order access. You can see the load-acquire right here. Store-release right here, and then sequential consistency. To read more about sequential consistency, please do look up the ARMv8 ISA.
ARM64 ISA and Third-Party Systems
Let's move on to the test systems that we used during our development as well as for CI/CD. I'm going to provide a quick timeline with respect to the ARM ISA. ARM64 started at ARMv8, and we have a system called the eMAG system. eMAG is an Ampere computing product. eMAG is also known as AppliedMicro's, X-Gene 3, and Skylark. Next in the ISA timeline comes ARMv8.1. With respect to v8.1, we have the ThunderX2 systems. For those that do not know anything about ThunderX2 systems, let me assure you that it's quite a fascinating system. It has 256 hardware threads. ThunderX2 is a Cavium Inc.'s product. Now it's owned by Marvell Technologies. Next comes ARMv8.2. We have our very own Surface Pro X systems. As many of you may know, Microsoft and Qualcomm worked together to bring us the SQ1 and SQ2 processors. Right now, many of the systems that are based on v8.3 architectures are either in development or have been announced and will be available soon.
What Is an OpenJDK Port?
Let's jump into our port. I'll provide background, a few nuances, as well as timeline. What is an OpenJDK port? Let's start there. Whenever you have a newer platform, especially a newer architecture or a new OS offering, then we have to make sure that Java is available to this new platform. In order to be able to run Java applications on this new platform, we need to make sure that we have something called the Java Runtime Environment available. Similarly, in order to be able to develop on this new platform, we need to make sure that the JDK, the Java Development Kit is available on this new platform. Let's look at a JRE first. A Java Runtime Environment consists of your virtual machine as well as your class libraries. The virtual machine itself consists of runtime as well as the execution engine. The class libraries consists of the UI toolkit as well as base, Lang, utils, so on and so forth. A JDK is a superset of a JRE. A JDK provides tools and utilities for a Java developer to be able to debug as well.
Overview of HotSpot
Let's look at OpenJDK HotSpot, the virtual machine. What I'm going to do is I'm going to provide an overview from the point of view of the repo itself, so the source repo in OpenJDK. What is very interesting is the way the code is organized within the HotSpot directory. Most of the code as you may know, is not OS specific and neither is it architecture specific. For example, your code for memory management, your JIT compilers, your metadata, all that resides in a shared folder. Then the rest is organized as follows. We will have the CPU folder for anything that's architecture specific. For example, for us, we were looking at the AArch64 directory.
One thing I wanted to highlight here and also give a shout out, is to the Red Hat OpenJDK Community. The Red Hat team was the first one to port OpenJDK to ARM64. They did that for Linux. We were lucky to be able to follow their footsteps. That's why it was easy for us because they had already paved the way for us. Big shout out to Red Hat for all the work. Similarly, there is the OS directory that has OS specific code. Then you'll see Windows and Linux in there. This time, I want to provide a shout out to Oracle because of the amazing directory setup, the modular organization of the code base. Finally, comes the OS and architecture specific code, and that resides in OS_CPU directory. Most of the changes that we did was, of course, in the Windows and our AArch64 directory.
What Is a Runtime?
Let's quickly dive into runtime. There's one goal for runtime and that is turning bytecode into native code. Many of you may know runtime as interpreter, but the runtime also performs various other functions such as classloading, synchronization, thread management. Our changes with respect to runtime were in the JVM construction and destruction. We wanted to make sure that it understands the structured exception handling that is needed for Windows.
What Is An Execution Engine?
Next, let's look at the execution engine. When we talk about the execution engine, we mean your JIT compilers, your memory management units. HotSpot has two JIT compilers, C1 also known as the client compiler, and C2 known as the server compiler. HotSpot provides tiered compilation, so you have various stages with respect to the profile guided information. Then finally, you reach C2. Many of our changes were in the execution engine, and they were to be able to enable ARM64 specific changes, Windows specific changes, and Windows ARM specific changes.
Windows and AArch64 Specific Learnings
The first and foremost thing that I wanted to highlight is an OS specific nuance. It's the way register R18 is handled. R18 is a platform register, and so that means that it's reserved. It has special meaning in user and kernel mode. We had to treat it as reserved. If you remember what I talked about existing Linux and ARM64 work that was done in the HotSpot directory, so this particular nuance is specific to Windows. We've also found out that it's applicable to Mac as well. We could take this learning into the Mac port as well. The next thing is the ability to invalidate instruction cache. We have a method that we can use in Windows by invoking the process thread API. It's called the FlushinstructionCache method right here.
On the windows and ARM64 platform, we could put in many optimizations for copy and byte swaps. The other thing we could do was to identify features that are very CPU specific. Those are like the AES instructions, the CRC32. This is a snippet of how we did that in the code base, right here.
Windows and MSVC Specific Learnings
Moving on to Windows and MSVC. There are a lot of intrinsics that your static compiler will offer. We worked with the MSVC team to learn more about those intrinsics. I already talked about the read-write barriers. We also incorporated a few built-ins, like for example, here _nop(). Here as you can see, we checked for the compiler, and then you asked to use the MSVC intrinsic for _nop(). Finally, and the most intriguing of all the changes was the LP64 versus the LLP64 change. Let me take some time to explain what that means. Basically, we had to extend 64-bit long ints and pointers to 64-bit long-long ints and pointers, because that's what the Windows platform needs. Basically, within the LP64, and LLP64 model, if I had to do a table, it would look something like that. As you can see, long on an LP 64 is 64-bit, whereas long on an LLP64 is 32-bit only. Long-long on both LP64 as well as LLP64 are 64 bits. Why does it matter to us? Because again, if you remember that Red Hat already had a Linux port for ARM64. That was, of course, following the LP64 model. Because we introduced Windows to ARM64, and Windows follows the LLP64 model, so we encountered a lot of issues, at the very beginning. We had to make sure that we are updating all the pointers and ints accordingly.
Meet Us and Our Port - The Trifecta
Let's jump on to the timeline. Here is where you get to meet the team. First comes me. Back in February, I started working on this port, part-time, just trying to investigate the amount of work that may be needed. I started fiddling on an eMAG system that had Windows on it. I was able to get the Java version string printed, and right after that, there was a core dump. After that, Ludovic joined the team. At the same time, we started interacting with the MSVC team and trying to learn about the intrinsics. Ludovic also helped with bringing JTReg testing, while I was trying to get bits and pieces out of the benchmarks, and especially get JMH, which is the Java Microbenchmark Harness, which I'll talk about quickly, as I talk about testing and benchmarking.
Right around mid to end of April, we got our ThunderX2 and Surface Pro X systems. That helped us a lot with doing some scaling tests, because remember, I mentioned ThunderX2 systems being 256 hardware threads. There were some problems that this concurrency beast would be able to highlight that eMAGs were not able to highlight. Surface Pro Xs could also highlight some other issues. Beginning of May, we were able to get some of the SPEC SERT benchmarks working. SPEC SERT is a suite of benchmarks. This suite not only provide scores with respect to operations per second, it also provides you the power consumption for that server class system. It's very helpful, especially given that we are an ARM64 system to be able to measure the power gains. This was one of the benchmarks that we were targeting for our port.
By May, we had C1 and parallel GC enabled. We went with parallel GC because we encountered a bug in G1. Around the same time we realized that we needed to do some benchmark modifications as well as enable more than 64 cores. When we tried this benchmark on the ThunderX2 system this is how the CPU utilization looked. Like I said, we have 256 hardware threads, but we can only fire up 64 cores. By mid-May, thanks to Ludovic, we had C2 fully functional as well as I made some changes to SPEC SERT to make sure that all the JNI code as well as the more than 64 core identification was complete. We started full scale testing. By the end of all the changes, this is how the CPU utilization looked, while running the SPEC SERT benchmark right here.
By June, Bernhard joined our team. We started a dialogue with the Red Hat team. At that point, we wanted to not only socialize our patches, we also wanted to understand what they thought about our patches and what would be a complete set of test systems and combinations. If you remember the G1 GC bug that I mentioned, Bernhard actually helped fix that. The bug itself was not in G1 GC but in MSVC, and so we introduced a workaround in our port. By end of June, under the guidance of Red Hat, we surfaced our patches. By now we had divided our port into patches and incremental patches that could be applied on tip. When we started the port, we were at JDK 15, and now we were at JDK 16. This way we could easily release them. We also released our early access binaries on our GitHub repo, and we started testing for Swing in Java 2D. That's when we found the bug with respect to Surface Pro X. The bug was very simple. Thanks to the good relationship that we had with MSVC team, we had a quick fix to the bug.
Let's look at the JEP process. JEP is the Java Enhancement Proposal, and it's a very important step with respect to contributions of this magnitude. We started our JEP process by drafting the JEP in July. The JEP quickly became a candidate, thanks to the great collaboration offered to us by Oracle. Also, the reviews and everything were done very quickly for the JEP. At the same time, SPEC SERT changes were also approved by the SPEC committee. By August all three of us were already committers in the OpenJDK ARM64 project. By the end of August, we enabled all the garbage collectors and we did scaling tests on ThunderX2 with respect to all three, so G1 GC, ZGC, and Shenandoah. We enabled all of them and did scaling test on them. By end of September, our port was targeted for JDK 16. Now we are already integrated, and we are OpenJDK.
Testing Setup and CI
Let's jump to testing and benchmarking. We divided our changes into incremental patches. We wanted to make sure that those patches cleanly applied to tip. In order to be able to provide a complete set of test results, we wanted to make sure that other platforms did not see any regression. We tested on the Linux ARM64, of course, the Windows ARM64 platform, and Windows and Linux x86-64 as well. We enabled CI for JTReg tests, which is the regression test harness. This is a look into how we enabled each phase of it, which we're at the last phase right now, and that's the Adopt QA test that we still have to enable.
Here's a quick matrix of SPEC JBB 2015 benchmark and how we tested on different systems right here. Here's a quick overview of our workload status and the benchmarking matrix. This is only a subset. To find the entire set, please check out our GitHub page. You can see, we've enabled a few of those. Some of them had problems with respect to some very architecture specific code.
Next Steps
Our journey has just begun. We not only want to be keeping our port up to date, we also want to work very closely with the Windows in-memory management team at Microsoft. We already see certain API level changes that we can bring to certain garbage collectors that can benefit from them. We will continue working with MSVC team and make sure that OpenJDK benefits from any of the optimizations there. I mentioned briefly about the macOS port. That's something else that we are contributing to. You can find the RFE right here. We've brought our learnings not only with respect to the register R18 that I mentioned, but also to the CPU feature detection. We also recently added JVMCI and AOT support to our code. We are in the process of backporting to JDK 11 update. We'll provide this backport for both Windows and macOS.
Demo
I have a demo right here. This is a Surface Pro X. You can see the processors there. I'm now going to go modify the code which says, "Howdy from Spring Boot using Java running on Windows and ARM64." I close that. Now I'll go look at the Java version. Yes, it's showing Windows and ARM64. That's great. Awesome. I now go build and run the package. That looks like a good start. I open a browser, look at the localhost 8080 port. There you see, "Howdy from Spring Boot using Java running on Windows and ARM64." That's it for the demo.
Resources
I have provided some links here. Please go ahead and check this out. Our PR that got merged is right here. A few announcements with respect to Windows and macOS ports that we have helped with. Go ahead, download the binaries and take it for a spin on your Surface Pro Xs.
See more presentations with transcripts