BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News TornadoVM 2.0 Brings Automatic GPU Acceleration and LLM support to Java

TornadoVM 2.0 Brings Automatic GPU Acceleration and LLM support to Java

Listen to this article -  0:00

The TornadoVM project recently reached version 2.0, a major milestone for the open-source project that aims to provide a heterogeneous hardware runtime for Java. This release is likely to be of particular interest to teams developing LLM solutions on the JVM.

The project automatically accelerates Java programs on multi-core CPUs, GPUs, and FPGAs. It does not replace existing JVMs, but instead adds the capability of offloading Java code to the backends, handling memory management between Java and hardware accelerators, and running the compute-kernels. This capability provides a key component of modern cloud and ML workloads.

InfoQ has previously covered the project in 2020 and 2022.

TornadoVM compiles Java bytecode at runtime (by acting as a JIT compiler) to one of three backends: OpenCL C, NVIDIA CUDA PTX, and SPIR-V binary. Developers can choose which backends to install and run depending on their specific systems.

Note that not every sort of Java computation is amenable to being offloaded to TornadoVM. For example, workloads with for-loops that do not have dependencies between iterations are very good candidates, as these allow computation in parallel.

In particular, matrix-based applications such as machine learning and deep learning are good candidates. Other good examples of this pattern are physics simulations (e.g., N-body particle computation), financial applications such as Black-Scholes, and a range of applications in computer vision, computational photography, natural language processing, and signal processing.

TornadoVM offers two complementary ways to express parallelism: the Loop Parallel API, which uses Java annotations such as @Parallel and @Reduce to parallelize loops, and the Kernel API, which uses a KernelContext for explicit GPU-style programming (with concepts such as thread IDs, local memory, barriers available), and which is similar to CUDA/OpenCL/SYCL.

The Loop Parallel API can be as simple as adding a type annotation:

public static void vectorMul(FloatArray a, FloatArray b, FloatArray result) {
    for (@Parallel int i = 0; i < result.getSize(); i++) {
        result.set(i, a.get(i) * b.get(i));
    }
}

Whereas the Kernel Context style explicitly builds a TaskGraph as a Java object, like this:

var taskGraph = new TaskGraph("multiply")
      .transferToDevice(DataTransferMode.FIRST_EXECUTION, a, b)
      .task("vectorMul", Example::vectorMul, a, b, result)
      .transferToHost(DataTransferMode.EVERY_EXECUTION, result);

var snapshot = taskGraph.snapshot();
new TornadoExecutionPlan(snapshot).execute();

The team is also shipping a complete LLM inference library built with it in pure Java that provides LLM inference on GPUs, all in Java without external dependencies.

The just-shipped release v0.3.0 of GPULlama3.java brings significant performance and usability improvements.

  • ~30% performance boost on NVIDIA GPUs (tokens/sec)
  • Optimized FP16 and Q8 kernel generation.
  • Easier setup thanks to the new TornadoVM SDKs -- no complex GPU configuration.
  • Run across NVIDIA PTX, OpenCL, and early Apple Silicon support.
  • Enhanced Quarkus support
  • Integration with LangChain4j

GPULlama3.java currently supports several FP16 (16-bit floating point) and 8-bit quantized models, in the single-digit billions of parameters range:

  • Llama 3.2 (1B) – FP16
  • Llama 3.2 (3B) – FP16
  • Llama 3 (8B) – FP16
  • Mistral (7B) – FP16
  • Qwen3 (0.6B) – FP16
  • Qwen3 (1.7B) – FP16
  • Qwen3 (4B) – FP16
  • Qwen3 (8B) – FP16
  • Phi-3-mini-4k – FP16
  • Qwen2.5 (0.5B)
  • Qwen2.5 (1.5B)
  • DeepSeek-R1-Distill-Qwen (1.5B)

Depending on the selected model, a different execution plan will be built, corresponding to the relevant model architecture.

The project is led by the Beehive lab, which is part of the Advanced Processor Technologies Group at the University of Manchester, specializing in the codesign of combined hardware / software solutions.

The team has also developed TornadoInsight, a plugin for IntelliJ IDEA that enhances the developer experience when working with TornadoVM.

Future work on the roadmap includes making TornadoVM available on SDKman and moving the JNI components in the codebase to use the new FFM API instead.

About the Author

Rate this Article

Adoption
Style

BT