Facilitating the spread of knowledge and innovation in professional software development

Contribute

### Topics

InfoQ Homepage Articles TornadoVM: Accelerating Java with GPUs and FPGAs

# TornadoVM: Accelerating Java with GPUs and FPGAs

Leia em Português

This item in japanese

### Key Takeaways

• TornadoVM is a programming and execution framework for offloading and running JVM applications on heterogeneous hardware (multi-core CPU, GPUs and FPGAs)
• TornadoVM extends the Graal JIT compiler with a new backend for OpenCL
• Applications written for TornadoVM are single-source -  the same code is used to express the host code and the accelerated code

Last March, I gave a talk at QCon-London about TornadoVM, where I provided an introduction to TornadoVM and explained how it works. In this article, I expand from the QCon London talk, and I show more details about how developers can benefit from it by automatically running Java on heterogeneous hardware.

Firstly, I will provide a general overview of the TornadoVM project and architecture. Secondly, I will explain different parts of TornadoVM with a practical example.

## Why do we need something like TornadoVM?

There is no single computer architecture that is best for executing all types of workloads efficiently. This leads to the proliferation of heterogeneous hardware in recent years which means that every system we program is likely to have a mix of computing elements.

Each of these elements has different hardware characteristics. Hardware heterogeneity enables programmers to improve the performance of their applications while decreasing energy consumption.

These new heterogeneous devices for computing include multi-core CPUs, Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). This diversity is great, but we need a way to efficiently program these new devices.

A prime example is represented by the two most popular heterogeneous programming languages, CUDA and OpenCL. However, they expose several low-level features in the API, making them difficult to be used by non-expert users. As an example, I highlight the following quotation from the OpenCL 3.0 standard:

The target of OpenCL is expert programmers wanting to write portable yet efficient code. [...] Therefore, OpenCL provides a low-level hardware abstraction plus a framework to support programming, and many details of the underlying hardware are exposed.

The previous statement also applies to CUDA and similar parallel programming models. Instead of using low-level programming languages, in industry and academia developers tend to use higher-level, object-oriented programming languages, typically executed on managed runtime environments, such as Java, R, Python, and JavaScript. Although many programmers might expect that such programming languages would have already been adapted for transparent execution on heterogeneous hardware, the reality is that their support is either very limited or absent.

In this article, we explore TornadoVM, an alternative to low-level parallel programming languages for heterogeneous computing. We show how developers can make use of multi-core CPUs and GPUs without any required knowledge about the parallel computing architectures or parallel programming models.

In a nutshell, TornadoVM is a parallel programming framework for JVM languages that can transparently and dynamically offload Java bytecodes into OpenCL, and execute the generated code on heterogeneous hardware. Additionally, TornadoVM integrates an optimizing runtime, that can reuse device buffers and save data transfers across devices, and a novel dynamic application reconfiguration component to perform live task migration across computing devices.

## Let's get started!

The following Figure shows a high-level overview of the TornadoVM project. As we can see, TornadoVM is composed of a layered and microkernel software architecture, in which the core component is the TornadoVM execution engine. At the top level, TornadoVM exposes an API to the developers. This is because TornadoVM currently does not detect parallelism (auto-parallelization). Instead, it exploits parallelism. Therefore, TornadoVM needs a way to identify which methods or functions are candidates for running on GPUs and FPGAs.

Additionally, TornadoVM contains a core-runtime, which is divided into several components: a) the data flow optimizer with a new bytecode generator; b) a small bytecode interpreter to run the new bytecodes, and c) the JIT compiler and memory management.  In this article, I will focus on the API, the runtime, and a general overview of the JIT compiler.

Finally, as the previous Figure points out, TornadoVM currently supports Java 8, using the latest JDK (u242) and JVMCI, and OpenJDK 11 via GraalVM 19.3.0. TornadoVM is also compatible with OpenCL 1.2, and this leads to run on a broad set of devices such as GPUs (AMD and NVIDIA), FPGAs (Xilinx and Intel), Integrated GPUs (such as Mali ARM and Intel HD Graphics) as well as multi-core CPUs.

Let's get into the details with a practical example. As follows, I show how to program and run matrix multiplication with TornadoVM on multi-core CPUs, GPUs and integrated GPUs. Matrix multiplication is an easy code to start with to illustrate different concepts in TornadoVM, and it constitutes the core of many machine learning and deep learning applications.

Note: although TornadoVM is programmed in Java, compute kernels can be exposed to other JVM languages via the Polyglot programming framework from GraalVM (Truffle).

The following code snippet shows the matrix multiplication programmed in Java:

class Compute {
public static void matrixMultiplication(final float[] A, final float[] B, final float[] C, final int size) {
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
float sum = 0.0f;
for (int k = 0; k < size; k++)
sum += A[(i * size) + k] * B[(k * size) + j];
C[(i * size) + j] = sum;
}
}
}
}


The code snippet shows the classic and canonical matrix multiplication example for GPU computing. To accelerate this code snippet with TornadoVM, we first have to annotate the loops that can be parallelized. In this case, we can fully parallelize the two outermost loops, in which there are no dependencies between iterations. We annotate the code by using the TornadoVM annotations @Parallel as follows:

class Compute {
public static void matrixMultiplication(final float[] A, final float[] B, final float[] C, final int size) {
for (@Parallel int i = 0; i < size; i++) {
for (@Parallel int j = 0; j < size; j++) {
float sum = 0.0f;
for (int k = 0; k < size; k++)
sum += A[(i * size) + k] * B[(k * size) + j];
C[(i * size) + j] = sum;
}
}
}
}


The @Parallel annotation is used as a hint by the TornadoVM JIT compiler (which transforms Java bytecode into OpenCL).

The TornadoVM JIT compiler does not force parallelization. Instead, it checks whether the annotated loops can be parallelized, and it replaces the for-loops for the equivalent parallel indexing in OpenCL (get_global_id(dimension)). If the for-loops cannot be parallelized, TornadoVM bails out and executes the sequential code.

Additionally, developers must identify which Java methods to accelerate. To do so, TornadoVM exposes a lightweight task-based API , that sets the list of methods to be accelerated - where each method corresponds to a task. Developers can create a group of tasks via a task-scheduler. The following code snippet shows how to create a task-schedule for the matrix-multiplication example:

TaskSchedule t = new TaskSchedule("s0")
.task("t0", Compute::matrixMultiplication, matrixA, matrixB, result, size)
.streamOut(result);


We create a task-schedule object (t). In its constructor, we pass a name for the task. It could be any name. This name is useful for changing the device in which all tasks are going to be executed. Then we define a set of tasks. In this example, we only have one, but it could be any number of tasks.

The parameters for the tasks are as follows: we also pass a name (in this case is “t0”) and a reference to the method we want to accelerate (in this case it points to the method matrixMultiplication from the Java class Compute. The rest of the parameters correspond to the actual set of parameters for the method.

Finally, we indicate which variables, or arrays, we want to synchronize with the host (the CPU). This is needed because usually, GPUs and FPGAs do not share the same memory as the CPU. Therefore, the TornadoVM runtime will allocate space for all the variables on the target device, and it will perform a data transfer from the host (CPU) to the device (e.g., a GPU). Therefore, to finally obtain the result, we synchronize the list of variables through the TornadoVM API call streamOut

So far we have declared our tasks, and we have placed them in the code where the parallelization can be performed. To execute the application with TornadoVM, we need to call the execute() method on the TaskSchedule object.

This is a blocking call that will create all OpenCL buffers, create a graph of execution, compile all tasks from Java bytecode to OpenCL, and finally execute the generated OpenCL program on the target device.  Additionally, TornadoVM can combine many methods to be compiled together in a single compilation unit, and be executed on the same device (e.g., on the same GPU). This creates an opportunity for optimizing data transfers between host and heterogeneous devices since they usually do not share the memory with the primary host (unless the device is an integrated GPU, such as AMD APU, ARM Mali or Intel HD Graphics GPUs).

Note that we do not set any device-specific information in the source code, and we share the same code for running on multi-core CPU, GPUs and FPGAs. The TornadoVM runtime and JIT compiler will automatically optimize the code depending on the architecture.

So, let’s run our code example. I will show you first how to set up the TornadoVM environment. There is a repository on Github with all these examples.

### Running Matrix Multiplication: Setting TorandoVM

We are going to run TornadoVM using Graal 19.3.0 as a JDK. Note that we update the Graal version frequently. The integration of Graal 20.x into TornadoVM is scheduled for the end of this year. To execute the code, we assume that OpenCL is installed. See all prerequisites here.

$mkdir -p TornadoVM$ cd TornadoVM
$wget https://github.com/graalvm/graalvm-ce-builds/releases/download/vm-19.3.0/graalvm-ce-java11-linux-amd64-19.3.0.tar.gz$ tar -xf graalvm-ce-java11-linux-amd64-19.3.0.tar.gz
$export JAVA_HOME=$PWD/graalvm-ce-java11-19.3.0
$git clone --depth 1 https://github.com/beehive-lab/TornadoVM$ cd TornadoVM
$export PATH=$PWD/bin/bin:$PATH$ export TORNADO_SDK=$PWD/bin/sdk$ export CMAKE_ROOT=<SET YOUR PATH TO CMAKE ROOT>
$make graal-jdk-11$ export TORNADO_ROOT=$PWD  Now we download the repository with the examples. $ git clone https://github.com/jjfumero/qconlondon2020-tornadovm
$cd qconlondon2020-tornadovm/$ export JAVA_HOME=/path/to/graalvm-ce-java11-19.3.0
$export PATH="${PATH}:${TORNADO_ROOT}/bin/bin/" ## Defined previously$ export TORNADO_SDK=${TORNADO_ROOT}/bin/sdk$ export CLASSPATH=target/tornado-1.0-SNAPSHOT.jar
$mvn clean install  Now we have everything ready to execute the examples. We can start by exploring which devices are available and visible from TornadoVM. $ tornado --devices
Total number of devices  : 3

NVIDIA CUDA -- GeForce GTX 1050
Global Memory Size: 3.9 GB
Local Memory Size: 48.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [1024, 1024, 64]
Device OpenCL C version: OpenCL C 1.2

Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Global Memory Size: 31.0 GB
Local Memory Size: 32.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [8192, 8192, 8192]
Device OpenCL C version: OpenCL C 1.2
Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO
Global Memory Size: 24.8 GB
Local Memory Size: 64.0 KB
Workgroup Dimensions: 3
Max WorkGroup Configuration: [256, 256, 256]
Device OpenCL C version: OpenCL C 2.0


In my case, I have three devices available on my laptop: an NVIDIA GPU, an Intel multi-core CPU and an Intel HD Graphics (integrated GPU). TornadoVM selects device 0 by default. However, we can change the device by associating tasks to devices. Let’s start with the default configuration.

$tornado qconlondon.MatrixMultiplication 512 tornado This program executes the Matrix Multiplication method 100 times and reports the total time per iteration. This method is a simple example to demonstrate what’s happening - later on we’ll do a proper performance comparison using JMH. $ tornado qconlondon.MatrixMultiplication 512 tornado
Computing MxM of 512x512
Total time: 77568790 (ns), 0.0776 (s)
Total time: 3133182 (ns), 0.0031 (s)
Total time: 3126146 (ns), 0.0031 (s)
…


Note that the first iteration takes longer than the rest of the iterations - this is due to JIT compilation warmup and will disappear as an effect when we’re using JMH.

The first time we execute a task-schedule, TornadoVM invokes the OpenCL JIT compiler to optimize and generate OpenCL C code from Java bytecode. Then, once the code is generated, TornadoVM installs the generated code in a code-cache, and the binaries can be reused if the same task is executed again at any point during runtime. To ensure that TornadoVM is running on the GPU (device 0), we can enable debug-information as follows:

$tornado --debug qconlondon.MatrixMultiplication 512 tornado Computing MxM of 512x512 task info: s0.t0 platform : NVIDIA CUDA device : GeForce GTX 1050 CL_DEVICE_TYPE_GPU (available) dims : 2 global work offset: [0, 0] global work size : [512, 512] local work size : [32, 32, 1]  That’s great, TornadoVM is running our Java code for matrix multiplication on an NVIDIA GTX 1050. For reference, let’s also run the sequential application. This is done without invoking the TornadoVM JIT compiler to accelerate the code. We pass an extra parameter to our program to indicate this: $ tornado qconlondon.MatrixMultiplication 512 sequential
Computing MxM of 512x512
Total time: 259398036 (ns), 0.2594 (s)
Total time: 247857535 (ns), 0.2479 (s)
...

What we see is that, even with the TornadoVM JIT compiler, the first iteration is 3.3x times faster. Then, from the second iteration, we get 80x speedup over the Java sequential code.  Nevertheless, take this number with caution. In the next section, we introduce a performance comparison using Java JMH.

### How to change the device?

We can change the device in which to run the application from the command. For example, to run on the Intel Integrated Graphics, we can execute with the following options:

$tornado -Ds0.t0.device=0:2 --debug qconlondon.MatrixMultiplication 512 tornado Computing MxM of 512x512 task info: s0.t0 platform : Intel(R) OpenCL HD Graphics device : Intel(R) Gen9 HD Graphics NEO CL_DEVICE_TYPE_GPU (available) dims : 2 global work offset: [0, 0] global work size : [512, 512] local work size : [16, 16, 1]  The syntax is as follows -D<taskScheduleName>:<taskName>.device=0:<deviceIndex> ### Performance of TornadoVM for MxM running on Dell XPS 15 laptop With these options, we can easily start getting some performance results. The following figure shows the speedup of TornadoVM when running TornadoVM on different OpenCL devices over the Java sequential implementation (the higher, the better). The speedup reported corresponds to the average value using the Java JMH framework for benchmarking. Note that the y-axis is represented in logarithmic scale due to the high-speedups. All benchmarks using JMH are included in the same repository with the examples. As we can see, running on a multi-core CPU with TornadoVM can achieve up to 3.6x compared to Java Hotspot. When running on GPUs, we can achieve up to 39x and 270x compared to Java for Intel HD graphics and NVIDIA 1050 respectively. ## Execution Model and Compilation So far, we have briefly explained the TornadoVM API and how to run applications with TornadoVM at the user level. Let’s now go a bit deeper and see how TornadoVM executes code on the target device. The following figure shows a representation of the execution flow between JVM and TornadoVM. The definition of the task-schedule and the invocation to the execute method from the TornadoVM API runs on a single Java thread (e.g., master thread). The execute method is a blocking call, and, when the execution of the method returns, it guarantees that the execution on the parallel device has finished. When the execute method is invoked, TornadoVM first builds a data flow graph that represents how data are communicated across different tasks within a task-schedule. This graph is used to optimize data transfers. Then, TornadoVM generates new bytecodes (simple instructions to orchestrate the execution on the target devices, such as COPY_IN, LAUNCH, COPY_OUT, BARRIER, etc.). When the code is launched the first time (via the LAUNCH bytecode), TornadoVM invokes the OpenCL JIT compiler and transforms the input Java bytecodes from each task (each Java method to be accelerated) to optimized OpenCL C code. TornadoVM specializes in the OpenCL C code depending on the target device, which means that the code generated for a GPU is different for CPUs and FPGAs. This is due to the fact that OpenCL code is portable across devices, but performance is not uniform. Therefore, TornadoVM increases performance by specializing and applying different optimizations per device. NOTE: The TornadoVM JIT compiler runs in a single thread, so there are concerns about potential exhaustion of compiler resources under heavy load, just as we see in HotSpot. The final step of compilation is performed through an OpenCL driver invocation to compile from the optimized and specialized OpenCL C code to the target platform. For example, if the application is executed on NVIDIA GPUs, this step generates the corresponding PTX code. Once the OpenCL code is generated and compiled, TornadoVM launches the application on the target device. To do so, TornadoVM deploys many threads for running the kernel. The amount of threads to deploy depends on the input sizes of the applications and the hardware characteristics. For instance, the matrix multiplication example we showed earlier is deployed on the GPU using a block of 512 by 512 threads. This means that TornadoVM deploys a block of 512x512 threads from the single thread Java application that was programmed. If the target device is a multi-core CPU, TornadoVM deploys the same amount of threads as the maximum number of CPU cores available. Once the execution on the parallel device finishes, TornadoVM copies the results to the Java’s heap (to make it visible to the host side through the bytecode COPY_OUT), and finally it returns control to the master thread in JVM. We can query the bytecodes that TornadoVM generates for each application. For instance, the following code snippet shows a simplified output when running the matrix multiplication with debug information of the TornadoVM’s bytecode: $ tornado --printBytecodes qconlondon.MatrixMultiplication 512 tornado

vm: COPY_IN [F@3e694b3f on NVIDIA -- GeForce GTX 1050
vm: COPY_IN [F@397fbdb on NVIDIA -- GeForce GTX 1050
vm: COPY_IN [F@33d512c1 on NVIDIA -- GeForce GTX 1050
vm: LAUNCH task s0.t0-matrixMultiplication on NVIDIA -- GeForce GTX 1050
vm: STREAM_OUT_BLOCKING [F@33d512c1 on NVIDIA -- GeForce GTX 1050


The matrix multiplication method we introduced earlier receives three parameters (matrices A, B and C). For each variable, TornadoVM performs a data transfer from the host to the device (COPY_IN). Then it runs the application by using the LAUNCH bytecode.

Just to recall, the first time the LAUNCH is executed, TornadoVM invokes the OpenCL JIT compiler, in which the code is specialized and optimized per computing device. Finally, TornadoVM performs a copy (STREAM_OUT_BLOCKING) from the device to the main hosts to obtain the results.

## Analyzing the OpenCL Generated Code

Let’s dig into the OpenCL kernel that TornadoVM generates. With TornadoVM, we can debug and check the generated kernel by using the --printKernel flag as follows:

\$ tornado --printKernel qconlondon.MatrixMultiplication 512 tornado

TornadoVM generates one kernel per task within a task-schedule. Additionally, it generates a kernel called lookupBufferAddress, that is executed during bootstrap of the VM. The reason behind this kernel is that TornadoVM only allocates one big buffer that acts a heap on the target device. To do so, it needs a valid pointer that will be used as a base address from the target device in which TornadoVM can perform data transfers. The lookupBufferAddress kernel returns this base pointer.

The second kernel corresponds to the OpenCL code from the Java methods we accelerated. The following code snippet shows a simplification of the generated kernel with comments on the main points from the Java and OpenCL code. Note that the generated kernel might differ depending on the target architecture. Note also that TornadoVM generates OpenCL C code from Static Single Assignment (SSA) representation, in which each variable is assigned exactly once. This is because TornadoVM is an extension of Graal-IR, which works in an SSA representation (as does HotSpot’s mainstream JIT compiler, C2).

__kernel void lookupBufferAddress(...parameters) {
__global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];
_frame[0]  =  (ulong) _heap_base;
}

__kernel void matrixMultiplication(...parameters) {
// Variables declaration …

__global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];
// Access elements within the stack-frame
ul_0  =  (ulong) _frame[6];   // base address of input matrix A
ul_1  =  (ulong) _frame[7];   // base address of input matrix B
ul_2  =  (ulong) _frame[8];   // base address of input matrix C
i_3  =  get_global_id(1);     // Parallel OpenCL indexing (2nd dimension)
i_4  =  i_3;
for(;i_4 < 512;)  {
i_5  =  get_global_id(0);   // Parallel OpenCL indexing (1st dimension)
i_6  =  i_5;
for(;i_6 < 512;)    {
i_7  =  i_4 << 9;
f_8  =  0.0F;
i_9  =  0;
for(;i_9 < 512;)      {
i_10  =  i_9 + 1;
i_11  =  i_7 + i_9;
l_12  =  (long) i_11;
l_13  =  l_12 << 2;
l_14  =  l_13 + 24L;                  // Skip Java object header
ul_15  =  ul_0 + l_14;
f_16  =  *((__global float *) ul_15); // Load element from matrix A
i_17  =  i_9 << 9;
i_18  =  i_17 + i_6;
l_19  =  (long) i_18;
l_20  =  l_19 << 2;
l_21  =  l_20 + 24L;
ul_22  =  ul_1 + l_21;
f_23  =  *((__global float *) ul_22);// Load element from matrix B

f_24  =  fma(f_16, f_23, f_8);       // Computation (fuse-multiple-add)
f_8  =  f_24;
i_9  =  i_10;
}
i_25  =  i_6 + i_7;
l_26  =  (long) i_25;
l_27  =  l_26 << 2;
l_28  =  l_27 + 24L;
ul_29  =  ul_2 + l_28;
*((__global float *) ul_29)  =  f_8;    // Store the result in Matrix C
i_30  =  get_global_size(0);
i_31  =  i_30 + i_6;
i_6  =  i_31;
}
i_32  =  get_global_size(1);
i_33  =  i_32 + i_4;
i_4  =  i_33;
}
}


## How is TornadoVM being used?

In this article, we have focused on a simple example, matrix multiplication, to easily show different parts of the TornadoVM runtime and JIT compiler. However, with TornadoVM, you can program more than just a single task, with simple data types. TornadoVM has been used to accelerate SLAM (Simultaneous Localization and Mapping) applications with the Microsoft Kinect Fusion, accelerating up to 90 frames per second acceleration compared to Java on NVIDIA GPUs. This application contains around 7k lines of Java code which are accelerated with TornadoVM, and it highlights the complexity of Java constructs that TornadoVM is able to generate.

In general, TornadoVM is suitable for accelerating workloads that follow the SIMD (Single Instruction Multiple Data) pattern, and pipeline applications. Surprisingly, this categorization includes a wide variety of applications such as deep learning, machine learning, mathematical and physics simulations, computational photography, computer vision, financial applications, signal processing and chemistry.

Additionally, developers can invoke TornadoVM from Python, R, Ruby, Javascript or any other language on top of GraalVM (as I show in QCon-London for accelerating a Node.js application).

TornadoVM was born in Academia (and is currently under development at the University of Manchester), but there are already some companies using TornadoVM for accelerating deep learning applications.

One example is Exus Ltd., a tech company based in London that is currently improving the UK NHS (Healthcare) system to predict the number of patients’ hospital readmissions, which has successfully improved the performance of the training phase of a data set of 2 million patients by 14x by using TornadoVM.

Another example of early adoption of TornadoVM in the industry is Luxembourg’s NEUROCOM, who are using TornadoVM on GPUs to accelerate some key calculations used in natural language processing by 10x and 28x (specifically, the Levenshtein distance and hierarchical classification using cosine similarity metric algorithms respectively).

## Summary

TornadoVM is a plugin to OpenJDK and GraalVM that allows developers to offline JVM applications into heterogeneous hardware, including multi-core CPUs, GPUs and FPGAs. Additionally, TornadoVM performs live-task migration between devices to maximize the performance of the overall applications. This article explored the functionality of TornadoVM through an example. We explored how TornadoVM is executed, and we discovered how the generated code looks like.

This article only scratches the surface of what TornadoVM is and what it can do. There are many important topics that we couldn’t cover in this introductory article. For example,  the description of compiler specializations per architecture, how to efficiently run reduce computations, FPGA compilation pipeline and live-task migrations. You can find further information about some of these topics by following these links:

## Acknowledgments

The TornadoVM development is partially supported by the European Union’s Horizon 2020 E2Data 780245.

Juan Fumero is a postdoc at the University of Manchester. His research topics are Heterogeneous High-Level Languages Virtual Machines, GPGPUs, and distributed computing. Currently, he is working as part of the TornadoVM and E2Data European projects for bringing automatic GPU and FPGA JIT compilation and execution for Java programs. He received a Ph.D. degree from The University of Edinburgh on Accelerating Interpreted Programming Languages on GPUs for Java, R, and Ruby. Additionally, he has also worked as an intern at Oracle Labs and CERN, implementing compilers, and evaluating parallel techniques for multi-core systems.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• ##### Very good

by Guilherme Alves Silveira,

• ##### Very good

Your message is awaiting moderation. Thank you for participating in the discussion.