HSA Foundation Targeting Heterogeneous GPU-CPU Execution for Java Virtual Machines by 2015
Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where large blocks of data are processed in parallel. But despite the popularity of OpenCL the actual process of moving work between the CPU and GPU is still difficult.
The HSA foundation was founded last year by AMD, Qualcomm, ARM Holdings, Samsung and others to work on this looking at both a hardware and software level.
From a hardware point of view there are a number of things that a combined CPU-GPU system must have in order to act as a heterogeneous compute layer. These are neatly summarized by Joel Hruska writing for ExtremeTech:
The CPU and GPU must share a common set of page table entries, they must allow both CPU and GPU to page fault (and use the same address space), the system must be able to queue commands for execution on the GPU without requiring the OS kernel to perform the task, the GPU must be capable of switching tasks independently, and both devices must be capable of addressing the same coherent block of memory.
There is also a software problem. GPUs vary, so you need some common way of addressing them. To deal with this, the HSA Foundation is working on a new intermediate language called HSAIL. In some ways HSAIL is conceptually similar to Java's bytecode or .NET's Intermediate Language; it allows you to write code in your language of choice (C++, Java and other languages) and have that code compiled to target HSAIL and run on whatever GPU is available in the system. As with Java there is likely to be a small performance overhead for using an intermediate language, but there are some major advantages. Gary Frost, a Software Fellow at AMD who is working on Sumatra and related technology, told InfoQ
By designing the JVM to generate HSAIL rather than vendor specific GPU ISA, the JVM compiler can concentrate on creating solid, portable HSAIL code. The vendor supplied ‘finalizer’ (the name used by HSA foundation for software which turns HSAIL into device specific ISA) can then happily apply ISA specific optimization techniques for the given target device at runtime. This approach also allows us to decouple the deployment cycle between the JVM and the device & finalizer. If for example a JVM capable of emitting HSAIL is made available and some time later a new HSA enabled accelerator device and finalizer is released, then the JVM can work with this new device, without requiring a JVM update.
For Java to be able to target HSAIL, the JVM itself needs to be capable of mapping to a graphics card. That part of the problem is being worked on by Oracle, AMD and others through OpenJDK Project Sumatra, illustrated on the right.
Announced last summer, Sumatra makes use of Java 8's lambda functions and new Stream API to enable both CPU and GPU computing. A Sumatra enabled JVM will extend traditional JVM JIT capabilities to include HSAIL as a target. It will also be the first time that the JIT is responsible for generating ISA both for the host (the platform running the JVM) and for another device/accelerator which specializes in data parallel execution.
Java 8’s lambda based stream APIs will have already introduced the need for developers to indicate which parts of the stream can be executed in parallel by adding the stream.parallel() directive into the stream. This is needed for Java 8 to dispatch such workloads across multiple threads.
At runtime, a Sumatra enabled Java Virtual Machine may choose to dispatch these parallel constructs to HSA enabled devices. The intention is to build on the work done for the new java.util.stream package in Java 8, which uses a fluent style API. Frost told us
That stream.parallel() call is enough for us; it's the same contract. If the user specifies that the stream can be handled in parallel then they are extending a trust to us. They are telling us that it is OK to run in parallel (on the GPU or across multiple CPU threads) and that there is no race condition in the code.
If possible the API will be left unchanged, but those empty parenthesis in the stream.parallel() API could be used to provide further hints, if needed, such as the number of cores (GPU or CPU) you want to use for the operation.
This possibility has been demonstrated by Sumatra forerunner Aparapi, which is a runtime capable of converting Java bytecode to OpenCL. Developers have taken Aparapi and successfully used it with Scala, according to Frost.
Aparapi is being updated to compile to HSAIL. Frost stated that the implementation
...uses lambdas, and will borrow a lot of the APIs that have come in in Java 8. Though of course they won't be the official OpenJDK APIs; we'll recreate them from our own side, to allow people who want to get some of the benefits of HSA from Java on top of Java 8 via Aparapi.
In addition, the HSA Foundation already has a number of tools available on GitHub, including an HSAIL Instruction Set Simulator, and tools for parsing, assembling, and disassembling HSAIL.