Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

### Topics

Lire ce contenu en français

### Key Takeaways

• Heterogeneous devices are now present in almost every computing system.
• Programmers need to handle such a broad and diverse set of devices, such as GPUs, FPGAs, or any other hardware that is coming.
• TornadoVM can be seen as a high-performance computing platform for Java and the JVM that works in combination with existing JDKs.
• With TornadoVM, the same source code can be executed in parallel, taking advantage of the device's capabilities, such as CPUs, GPUs, or FPGAs.
• TornadoVM's APIs allow non-experts to take advantage of parallel computing while at the same time enabling CUDA and OpenCL code to be ported to Java and TornadoVM.

At QCon Plus, Juan Fumero spoke about TornadoVM, a high-performance computing platform for the Java Virtual Machine (JVM). It allows Java developers to automatically run programs on GPUs, FPGAs, or multi-core CPUs.

Heterogeneous devices such as GPUs are present in almost every computing system today. For example, mobile devices contain a multi-core CPU plus an integrated GPU; laptops usually have two GPUs: one integrated into the main CPU and one dedicated, usually for gaming. Even data centres are also integrating devices such as FPGAs. Therefore, heterogeneous devices are here to stay.

All of these devices help to increase performance and run more efficient workloads. Programmers of current and future computing systems need to handle program execution on a broad and diverse set of computing devices. However, many of the parallel programming frameworks for these devices are based on the C and C++ programming languages. Programming such systems from managed and high-level programming languages such as Java is almost absent. That's why we introduced TornadoVM.

In a nutshell, TornadoVM is a high-computing programming platform for Java and JVM, allowing to offload, at runtime, Java code to run on heterogeneous hardware accelerators.

TornadoVM offers a Parallel Loop API and a Parallel Kernel API. In this post, we explain each of them, together with some performance benchmarks, and then how Tornado translates the Java code into the actual parallel hardware. Finally, we show how TornadoVM is being piloted in the industry, including some use cases.

## Fast path to GPUs and FPGAs

How is heterogeneous hardware accessed from high-level programming languages today? The following image presents some examples of hardware (CPUs, GPUs, FPGAs) and high-level programming languages such as Java, R or Python.

If we look at Java, we see that it executes on top of a virtual machine. Among others, OpenJDK, GraalVM, and Corretto are virtual machine (VM) implementations. Essentially, the application is translated from Java source code into Java bytecode, and then the VM executes this bytecode. If the application is executed frequently, the VM can optimise the execution by compiling frequently-run methods into optimised machine code – but only for CPUs.

If developers want to access heterogeneous devices, such as GPUs, or FPGAs, they usually do it through a Java Native Interface (JNI) library.

Essentially, programmers have to import a library and invoke that library through JNI calls. Note that, by using these libraries, programmers might have an application optimised for one particular GPU. But if the application or the GPU changes, the application may have to be rebuilt, or the optimisation parameters readjusted. Similarly, this also happens with different FPGA vendors or even other models of GPUs.

Thus, there are no complete JIT compilers and runtimes that work with heterogeneous devices in the same way as CPUs, in the sense that they can detect frequently executed code and produce optimised code for heterogeneous hardware. That's where TornadoVM comes into the picture.

TornadoVM works in combination with an existing JDK. It is a plugin to the JDK that allows programmers to run applications on heterogeneous hardware. Currently, TornadoVM can run on multi-core CPUs, GPUs and FPGAs.

## Hardware characteristics and parallelism

The next question that arises is, why all of this hardware? Three different hardware architectures are being considered: CPU, GPU, and FPGA. Each architecture is optimised for different types of workloads.

For example, CPUs are optimised for low latency applications, while GPUs are optimised for high throughput. FPGAs are a mixture between them: FPGAs usually can achieve very low latency and high throughput because applications are wired physically into hardware.

Let’s map these architectures to existing types of parallelism. In the literature, we can find three main types of parallelism: task parallelisation, data parallelisation, and pipeline parallelisation.

Usually, CPUs are optimised for task parallelisation, meaning that each core can run different and independent tasks. In contrast, GPUs are optimised for running data parallelisation, meaning that the functions and kernels executed are the same but take different input data. Lastly, FPGAs are very suitable for expressing pipeline parallelisation, in which the execution of different instructions overlaps across the different internal stages.

Ideally, we want a high-level parallel programming framework that can express the different types of parallelism to maximise performance for each device type. Now, let’s look at how TornadoVM is built and how developers can use it to express different kinds of parallelism.

TornadoVM is a plugin to the JDK (Java Development Kit) that allows Java developers to automatically execute programs on heterogeneous hardware. The key contributions of TornadoVM are as follows:

It has an optimised JIT (Just In Time) compiler that specialises the code per architecture. This means that, for example, the code generated for GPUs is therefore different from the code generated for CPUs and FPGAs to maximise performance for each architecture.

TornadoVM performs dynamic task migration between architectures and between devices. For example, it can run the application on a GPU for a while, migrating the execution later onto another GPU, FPGA, or multi-core, as necessary and without restarting the application.

TornadoVM is fully hardware agnostic: the source code of the application to be executed on heterogeneous hardware is the same for running on GPUs, CPUs, and FPGAs.

Finally, it can be used with multiple JDK vendors. It is open-source (available on GitHub), and Docker images are also available to run on discrete NVIDIA and Intel Integrated GPUs.

Let’s look at TornadoVM’s system stack. At the top level, TornadoVM exposes an API. This is because it exploits parallelism, but it doesn't detect parallelisation. Thus, It needs a way to identify where parallelisation is employed in the program’s source code.

TornadoVM offers a task-based programming API in which each task corresponds to an existing Java method. Thus, TornadoVM compiles code at the method level like the JDK or the JVM but into efficient code for GPUs and FPGAs. Annotations can also be used to indicate parallelism within methods. Additionally, methods can be grouped into tasks compiled together in one compilation unit. This compilation unit is called the Task-Schedule: a Task-Schedule has a name (for debugging and optimisation purposes) and contains a set of tasks.

The TornadoVM engine takes its input expressions from the bytecode level and automatically generates code for different architectures. It currently has three backends that generate OpenCL, CUDA, and SPIR-V code. Developers can select which one to use. Alternatively, TornadoVM will select a default backend.

## A blur filter as an example

We will now see how TornadoVM can accelerate Java applications with an example: a blur filter. Essentially, we have an image, and we want to apply a blur effect in that image.

Before going into the details of how it's programmed, let’s look at the performance of this application running on heterogeneous hardware. The image below shows benchmarks for four different implementations. The reference is a sequential implementation in Java, and the Y-axis represents the performance gain compared to this reference, so the higher, the better.

The first two columns from the left represent CPU-based executions. The first uses standard parallel Java streams, whereas the second uses TornadoVM on multiple CPU cores, yielding a speed-up of 11x and 17x, respectively. TornadoVM produces a better result because it generates OpenCL for the CPU, and OpenCL is very good at vectorising code to use vector units. If the application is run on integrated graphics, we can get up to 19x performance compared to the Java sequential implementation. If we run the application on a discrete NVIDIA GPU(2060), we can get up to 340x performance (using the OpenCL backend of TornadoVM). Comparing the speed-ups we get against the parallel version of the Java streams, which we can get right now in Java, TornadoVM achieves up to 30x performance when running on the NVIDIA GPU.

## Implementing the blur filter example

The blur filter is a map operator that applies a function (the blur-effect filter) for every input image pixel. This pattern is great for parallelisation because every pixel can be computed independently of any other pixel.

The first thing to do in TornadoVM is to annotate the code within each Java method to tell TornadoVM how to parallelise them.

Since each pixel’s computations can occur in parallel, we add the @Parallel annotation to the two outermost loops. This signals the TornadoVM to compute these two loops fully in parallel. Code annotations define the data parallelisation pattern.

The second thing is to define the tasks. Since the input is an RGB image, we can create one task per colour channel - Red, Green and Blue (RGB) channels. Therefore, what we are going to do is to process the blur filter per channel. A TaskSchedule object that contains three tasks is used for this purpose.

Additionally, it is necessary to define which data will be transferred in and out from the Java heap to the device (e.g., a GPU). This is because discrete GPUs and FPGAs don't usually share memory. Therefore, we need a way to tell TornadoVM which memory regions (arrays) need to be copied in and out of the device. That's done through the streamIn() and streamOut() functions.

Then the set of tasks is defined, one per colour channel. They are identified by a name and composed by a reference to the method to be executed together with its parameters. This method can now be compiled into a kernel.

Finally, the execute function is called to run the tasks in parallel on the device. Now let’s take a look at how TornadoVM compiles and executes code.

## How TornadoVM launches Java Kernels on parallel hardware

The original Java code is single-threaded, even though it has received @Parallel annotations. However, when the execute() function is called, TornadoVM starts to optimise the code.

First, the code is compiled into an intermediate representation for optimisation (TornadoVM extends the Graal JIT Compiler; all optimisations occur at this level). Then, TornadoVM translates the optimised code into efficient PTX, OpenCL, or SPIR-V code.

At this point, the code is executed, which causes hundreds or thousands of threads to be launched. The amount of threads run by TornadoVM depends on the application.

In this example, the blur filter has two parallel loops that iterate over one image dimension each. Therefore, TornadoVM creates a grid of threads with the same dimensions as the input image during runtime compilation. Each grid cell – in other words, each pixel – is mapped to one thread. For instance, if the image has 2000 x 2000 pixels, TornadoVM launches 2000 x 2000 threads on the target device (e.g., a GPU).

TornadoVM can also enable pipeline parallelisation, which is done primarily on FPGAs. When we select an FPGA to run, or Tornado selects the FPGA to run, it automatically inserts information in the generated code to pipeline instructions. This strategy can double performance in comparison with the previous parallel code.

## The Parallel Loop API vs the Parallel Kernel API

Let’s talk now about how compute kernels can be expressed in TornadoVM. TornadoVM has two APIs: a Parallel Loop API as we described in our Blur-Filter example, and a Parallel Kernel API. The TornadoVM’s parallel loop API is annotations-based. With this API, developers have to reason about their sequential code, provide a sequential implementation, and then think about where to parallelise the loops.

On the one hand, development is accelerated because developers can just add annotations to existing Java sequential code to obtain parallel code. The Parallel Loop API is appropriate for non-expert users, who don’t need to know the details of GPU computations or which hardware should be used.

On the other hand, the Parallel Loop API is limited in the number of patterns it can use. With this API, developers can run applications using the typical map/reduce pattern. However, other parallel patterns, such as scans or complex stencils, are hard to implement with this API. Also, this API doesn't allow the developer to control the hardware because it is agnostic, but some developers need that control. Also, it may be difficult to port existing OpenCL and CUDA code to Java.

To overcome these limitations, we added the Parallel Kernel API.

## Implementing the blur filter using the Parallel Kernel API

Let's go back to our previous example: the blur filter. We have two parallel loops that iterate over both image dimensions and compute the filter. This can be translated into the Kernel API.

Instead of having two loops, we introduce implicit parallelism through a kernel-context. A context is a TornadoVM object that the user can take advantage of, by giving access to the thread identifier for each dimension, as well as local/shared memory, synchronization primitives, etc.

In our example, the filter’s X and Y-axis coordinates are retrieved from the context’s globalIdx and globalIdy attributes, respectively, and are used to compute the filter as usual. This programming style is closer to the CUDA and OpenCL programming models.

As a side note, TornadoVM can not determine the necessary number of threads at runtime with the Kernel API. The user needs to configure them instead by using a worker-grid.

In this example, a 2D worker grid is created with the image’s dimensions and associated with the function name. When the user’s code calls the execute() function, the grid is passed, and the filter is executed accordingly.

But, if the Parallel Kernel API is closer to low-level programming models, why use Java instead of OpenCL and PTX, or CUDA and PTX, especially if there is existing code?

TornadoVM also has other strengths, such as live task migration, automatic memory management,  and transparent code optimisation, so the code is specialised depending on the architecture.

It also runs on FPGAs with a fully transparent and integrated programming workflow. You can use your favourite IDE, for example, IntelliJ or Eclipse, to run code on an FPGA.

It can also be deployed on the cloud, for example, Amazon instances. You get all these features for free by porting that code into Java and TornadoVM.

## Performance

Let’s talk about performance. TornadoVM can be used for more than just applying filters for computational photography. For example, for FinTech, or math simulations like the Monte Carlo or Black-Scholes. It can also be used for computer vision applications, physics simulation, signal processing, among many other domains.

The graph in the previous figure compares different application executions on distinct devices. Again, the reference is a sequential execution, and the bars represent acceleration factors, so the higher, the better.

As we can see, It is possible to achieve very high speed-ups; for example, signal processing or physics simulation can be up to four thousand times faster than a sequential execution in Java. For a detailed analysis of all of these results, you can check the list of academic publications

Some companies in the industry are also piloting TornadoVM. The figure above shows two different TornadoVM use cases being worked on.

One use case using TornadoVM comes from Neurocom Company in Luxembourg, which runs a natural language processing algorithm. So far, they have achieved a 30x performance increase by running their hierarchical clustering algorithms on GPUs.

Another use case comes from Spark Works Company, a company based in Ireland which processes information coming from IoT devices. A potent GPU, GPU100, is used to run that post-processing. They can get up to 460x performance compared to Java, which is quite good.

You can visit the TornadoVM website for a complete list of use-cases.

## Summary

Heterogeneous devices are now present in almost every computing system. There is no escape. They are here, and they will stay.

Therefore, programmers of current and future computing software systems need to handle the complexity of having a broad and diverse set of devices, such as GPUs, FPGAs, or any other hardware that is coming. They can program those devices through TornadoVM.

TornadoVM can be seen as a high-performance computing platform for Java and JVM that works in combination with existing JDKs. For example, with OpenJDK.

This article has introduced TornadoVM, what it is, and briefly explained how it works. Additionally, it showed how developers could benefit from heterogeneous hardware execution through an example for computational photography implemented in Java. We explained the two APIs for heterogeneous programming in TornadoVM: one uses the Parallel Loop API, suited for non-experts in parallel computing; the other relies on the Parallel Kernel API, suitable for expert developers that know CUDA and OpenCL already and want to port existing code into TornadoVM.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

• ##### Loom interface

by Eric Bresie,

• ##### Re: Loom interface

by Juan Fumero,

• ##### JEP 419: Foreign Function & Memory API

by Eric Bresie,

• ##### Re: JEP 419: Foreign Function & Memory API

by Juan Fumero,

• ##### Re: JEP 419: Foreign Function & Memory API

by Juan Fumero,

• ##### New Vector API

by Eric Bresie,

• ##### Re: New Vector API

by Juan Fumero,

• ##### Loom interface

by Eric Bresie,

Your message is awaiting moderation. Thank you for participating in the discussion.

Curious, are there potential overlaps with project Loom and/or Fibers?

• ##### JEP 419: Foreign Function & Memory API

by Eric Bresie,

Your message is awaiting moderation. Thank you for participating in the discussion.

Any overlsp with

JEP 419: Foreign Function & Memory API - openjdk.java.net/jeps/419

• ##### New Vector API

by Eric Bresie,

Your message is awaiting moderation. Thank you for participating in the discussion.

• ##### Re: Loom interface

by Juan Fumero,

Your message is awaiting moderation. Thank you for participating in the discussion.

TornadoVM deploys native threads on the accelerator (e.g., OpenCL threads or CUDA threads). We haven't tested but we believe having Fibers or Java threads should be transparent for TornadoVM, unless fibers can run on natively on accelerators.

• ##### Re: JEP 419: Foreign Function & Memory API

by Juan Fumero,

Your message is awaiting moderation. Thank you for participating in the discussion.

In fact, we have a prototype with the project Panama for better memory management within TornadoVM:

www.research.manchester.ac.uk/portal/files/2110...

We saw a performance increase using the Panama project and native buffers. We look forward for the Panama project to be up-streamed.

• ##### Re: New Vector API

by Juan Fumero,

Your message is awaiting moderation. Thank you for participating in the discussion.

And it is a much higher level compared to the Vector API. Potentially can be merged but it will require a lot of changes either in the TornadoVM Vector API or the OpenJDK Vector API.

Besides, the level of granularity is different. TornadoVM accelerates at the method level (whole Java methods), while the vector API accelerates operations associated with the input vectors. From our view, to accelerate the vector operations using the vector API, memory must be shared across the GPU memory and the CPU memory (e.g., by using the Intel HD graphics) to overcome overheads during memory copies across PCI-e.

• ##### Re: JEP 419: Foreign Function & Memory API

by Juan Fumero,

Your message is awaiting moderation. Thank you for participating in the discussion.

It seems that the link is broken. You can access the paper using the following links:

A) jjfumero.github.io/publication/2022-02-13-VEE22
B) www.research.manchester.ac.uk/portal/files/2110...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p