Transcript
Fumero: Do you know that many software applications are under-using hardware resources? That means that many software applications could run potentially faster while consuming less energy. This is because, today, computing systems include many types of hardware, like multi-core CPUs, GPUs, FPGAs, or even custom-designed chips like Google's. They have designed tensor processing units. You have all of these available for increasing performance. Expert programmers know this. What they normally do is they take existing applications, and they re-run portions of it in low-level languages like CUDA, OpenCL, or even SQL, the new standard. Those are the languages that allow you to run on heterogeneous hardware. This is a very tedious process. First of all, because we now have more than one type of hardware. The developers need to know which portion of the code correspond better to each hardware. This is because there is no single hardware that better executes all type of workloads.
The programmer has to know which portions are best suited for each one. Then the programmers have to know architecture details, for example, how to mess around the task scheduler, the data partitioning. All of these tricks can help you to increase performance. Plus, if you want to further increase performance, you have to go deep into the architecture, for example, GPUs have different levels of mid-tier memory. You have to know that one level is L1 cache, you can copy data there if you want to, but cache is not coherent. Barriers are up to the programmer. When the new generation of GPUs came along, or FPGAs, or accelerator, you have to repeat this process again, not from scratch, but you have to change your code. This is a very tedious process. I hope you feel my pain.
Instead of doing that, let's imagine the following. Let's imagine that we have a software system that can automatically take a high-level program and automatically execute on heterogeneous hardware. It doesn't mean only C or C++, especially in this community. We know that programmers can also come from other communities like Java, R, Ruby, Python. Wouldn't that be great? Because we're in this dreamy mode, let's also imagine that we can perform task migration across devices. That will be cool. I have just defined TornadoVM. This is exactly what TornadoVM does. TornadoVM is a plugin to OpenJDK and Graal that allows you to run Java, R, Groovy, Python, Scala programs on heterogeneous hardware without changing any line of code. Even more, with TornadoVM we can dynamically perform task migration across devices without restarting the application and without any knowledge from the perspective about the actual hardware.
I will explain the TornadoVM and some background. If you haven't heard about GPUs that much, or FPGA is something new, don't worry at all. I will explain the basics and how we use it from the TornadoVM perspective. Later, I will introduce how you can use Tornado, how you can execute it. Then I will show you some internality. I'm a compiler engineer. I would like to know everything inside. I would like to show you this passion we see as well, and how we can compile code at runtime. Basically, internality of the JIT compilation. I will also show how we can migrate execution at runtime and some demos. Hopefully, I can convince you that this type of technology is useful, in general, for managed runtime languages.
I'm Juan Fumero. A postdoc now at the University of Manchester. I'm currently the Lead Developer of TornadoVM project.
Why Should We Care About GPUs/FPGAs?
Why should we care about heterogeneous devices? It's something important. To motivate the answer, I'll show you three different microarchitectures. An Intel microarchitecture on the left-hand side, a GPU, and an FPGA. Let's focus on the Intel one. This one is Ice Lake microarchitecture. It's one of the latest by Intel. This one has 8 physical cores, plus AVX instructions. It has a GPU that is inside. It's called integrated GPU. If you run on this one, and you use all of these available just by default, you can get up to 1 Teraflop of performance. Let's look at the GPU. This one is Pascal microarchitecture. It's already two generations off on NVIDIA. This one is 16nm technology. This one, instead of 8 physical cores, we have 3500 physical cores that you can use. This gives you up to 10 Teraflops of performance. It's much higher than a single CPU. A similar situation applies for FPGAs. This one by Intel, you can get up to 10 Teraflops of performance.
What Is a Graphics Processing Unit (GPU)?
I will be talking a lot about GPUs and FPGAs. GPUs stands for graphics processing unit. At the beginning it was mainly used for rendering and computer graphics. However, a few years ago, researchers realized that some of the stages to do the rendering, can be used for general purpose computation. The GPU implements some stages like computing textures, volumes, vertices, and so on. That's where CUDA and OpenCL come from. We can use the GPU, not only for computing graphics, but also for general purpose computation: physics, machine learning, deep learning, Bitcoin. This one is Pascal microarchitecture. I want to highlight two things from here. Apart from the programming model, which you have to learn if you want to use it, you have to know architecture details in order to use them efficiently. It could be OpenCL, CUDA, or any other. That for many users could be a handicap. You don't have to be an expert to use it. You could be a biologist, a psychologist. Why not?
They actually have the need to run on those devices. Perhaps, we need high-level abstractions.
What Is a Field Programmable Gate Array (FPGA)?
Are you familiar with FPGAs? How many of you have heard about FPGAs? An FPGA stands for Field Programmable Gate Array. Basically, it's a piece of hardware that is empty after manufacturing. It's up to the programmer what to run in there. In some sense, it's like physically wiring your applications into hardware. To do so, the FPGA provides a logic slice, like lookup tables, flip flops, programmable memory, and DSPs. DSPs are specific functions to perform math operations. That could give you a lot of performance and a lot of energy saving, because you just run what you need, basically. However, the programmability here is a big issue. Normally, you program in VHDL, very low stuff. More recently, you can program using OpenCL. I'm telling you these because Tornado targets FPGAs. Tornado targets FPGAs at the method level, which means that we can physically wire your Java methods into hardware. How cool is that?
Ideal System for Managed Languages
I have been talking about pure hardware. But we need a way to program them. If you want to use GPUs and FPGAs, you might target something like CUDA, OpenCL, SQL, C++. We know that there are a lot of developers if you want to use it from Java, for example, for Python, for Ruby, you have to plug in an external library, right now. There is no such virtual machine. There is no such thing that you can automatically target a Java or Python program and run it directly without any knowledge on heterogeneous hardware.
That's what we propose. That's what we call heterogeneous virtual machine. Basically, it's a synonym of TornadoVM. With that, you can target Java, but also other languages. We released a new version. Actually, at the beginning, we only ran Java, now we can run more than Java. With this strategy, you can run on any type of hardware.
Demo: KinectFusion with TornadoVM
I want to show you the bechmark-suite first, then I will show you the details. This one is called KinectFusion. Do you know the Kinect, the Microsoft camera? Kinect is recording a room. The goal of the application is to render the whole application in real-time. What do we mean by real-time? The human perceives real-time around 30 frames, 30 images per second. That's the quality of service. The whole application is written in Java. It's around 7000 lines of Java code. It's open source.
First of all, I'm going to run in pure Java. There is no acceleration underneath, it's just OpenJDK 8. On the left-hand side, you're going to see the input. On the right-hand side, you're going to see the output. That's the input and different setup for the KinectFusion, like dark scene, light scene. The application is already running. It is around 1.5 frames per second. It is extremely slow. What I'm going to do now is I'm going to stop the application. I'm going to reset it. I'm going to use Tornado. You can run on different devices. I'm going to first set to run on a multi-core. When we choose to run Tornado on an Intel, or an AMD CPU, we could run a multi-core configuration. Hopefully, we can see that something is faster. It's hard to see but it's actually around 4 frames per second. That was recorded on my laptop. It is a 4 core machine. It's not that bad. What I'm going to do now is switching. I'm going to stop. I'm going to reset. I'm going to now use the NVIDIA GPU that is available on my laptop, the 1050. In a few seconds, you get the whole rendering of the whole room. It's just Java. We just offload the Java code onto the GPU.
Now I'm going to show you how this is done. I'm going to give you an overview of TornadoVM. TornadoVM has a layered architecture plus a microkernel architecture. We're talking about the software side. On the top level, we have an API. Why do we have an API? What we do is we exploit parallelism. We don't detect parallelism. Detection is a very hard problem. We need a way to identify which code regions you want to offload. This is done through a task based model. Each task corresponds to an existing Java method. We can combine many tasks on a single compilation unit. That's a task scheduler. Then we actually expose two annotations, @Parallel and @Reduce, to just identify which loops you want to parallelize. However, parallelization is what we call relax program semantics. That means that even if the user annotates the code, Tornado will double check that that code can be parallelized. Otherwise, it would just be allowed and executes the sequential code. We don't force parallelism.
Then we have a runtime system that, first of all, will optimize how data is flowing across the tasks. Normally, GPUs and FPGAs don't share memory with the main CPU. We need to allocate first the data over there, and then do the data transfer. That takes time, exactly going through PCI Express. If we can save data transfers, we can get speedup. That's the goal of the data flow optimizer. Once we have the data flow optimized, we generate new bytecodes. Those bytecodes are executing on top of Java bytecodes. Because we have our own bytecode, we need the bytecode interpreted to run those. That's a very simple process, actually, how to orchestrate the execution. One of those bytecodes is launch. Launch these methods on this device. The first time we launch, we call the JIT compiler, and say, now compile this method. For that we extend the Graal JIT compiler only to generate OpenCL. With this strategy we can currently target NVIDIA GPUs, AMD GPUs, Intel integrated graphics, FPGAs by Intel and Xilinx. We can run on top of OpenJDK 8 and GraalVM 19.3.
Let's go deep. I'm going to start with the API. I show you here a typical, very easy matrix multiplication. A Java class called compute, and one method called MXM. Then I show you the sequential code to run matrix multiplication. The first thing we do with Tornado is to annotate the sequential code, with the @Parallel annotation. With these, the user tells Tornado, these two loops might be parallel. The fact that the user annotates the code, doesn't mean it's going to be parallelized later on. This is just as a hint for the compiler, where to go. That's the first thing we do.
The second thing is we build the task scheduler. For doing that, we have an object called TaskSchedule, we pass a name. It could be any random name: foo, bar. We put that at runtime to change device, for example. Then we call task.task.task. Each task is a reference to an existing Java method. Here we say class compute method, MXM. Then the rest we have normal parameters for our invocation call. Then we have another call called StreamOut. This is because we don't normally share memory. GPU has some memory. We need a way to synchronize data again. We do that through the StreamOut operation. Then we call execute. That's all. We can add as many tasks as we want. In this case, I showed you just one. You can have 20, 100 tasks. To run this, we just type tornado and your class. In fact, this is because we are lazy people in our team. Tornado is an alias to Java plus all the parameters to enable Tornado, basically. This is just Java.
Demo: Running Matrix Multiplication
Now you know everything about Tornado at the user level. I'm going to show you a live demo now, to run the matrix multiplication. I'm going to run with Tornado command. I'm going to first run the sequential code. I have a flag to indicate, don't build the task schedule, just run the code with OpenJDK. I run this code multiple times, 100 times. Don't take this as a benchmark program. It's just to show you a quick demo. Let's see the time. I'm going to track the time for each iteration. I'm going to run the matrix multiplication, 100 times. First of all, the sequential. This is the size of the matrix. It's taking around 240 milliseconds per iteration. I'm not going to wait 100 iterations.
I'm going to enable Tornado. The default device is going to run on the GPU that I have here on my system. If I run Tornado, each iteration takes around 4 milliseconds. You can say, "I don't trust you." You can tell Tornado to give me debug information. That will tell me in which device we're running. I will run the debug info. It's telling me, you're running on NVIDIA 1050. It's NVIDIA that I have on my laptop. In fact, we can change the device. Before changing the device, let me show you, we generate OpenCL underneath for you. You can enable this with printKernel. Let me pipe this to this. This is the OpenCL kernel we generate for you. It looks very ugly. This is because we generate code from Graal. Graal is SSA representation. It will generate code based on SSA. It's legal OpenCL code. You don't have to go that deep if you don't want to. If you want to tune the code afterwards, you can produce the code, manually tune it, and plug in later on. That's actually the process we do when we build the compiler.
Let me now show you how we can run on another device. If I run this command, tornado--devices. It will tell me I have four devices available. The default one is the NVIDIA 1050. I have a multi-core. I have another multi-core but using different OpenCL drivers. Here I have an Intel integrated graphics. Let's do that. Let me remove the pipe. Remember that I gave names to the task schedule, and that's why they are useful. I can say, this task with this name running on device 03, which is the device of the integrated graphics. I'm running also with the debug info. Now I'm running on the Intel integrated graphics. If I have a FPGA plugin here, that would be cool. For now, there is no laptops with FPGA as far as I know. I could even run it there.
This is a very general overview. You annotated code. You build a task schedule. There is a magic box there. Then that magic box will generate an OpenCL code, which means that we need another JIT compiler later on to generate the actual binary for this.
And with this strategy, you can run on any other device. In fact, we can plug in all the languages apart from Java. We can plug in Node.js, Python, R, Groovy, JavaScript, Scala. To do so we go through GraalVM. Tornado is reading using Java. We have the .classes here. There is a component in Graal called Truffle. It's a framework that allows you to run all the languages on top of Graal. There is a component called polyglot that is talking to other languages. That guy is going to tell some of the code and classes here in Java, to talk to them. These Java classes are expressed with Tornado. That's why we can go through Node.js, through the GPU.
Demo: Node.js Example
In fact, I'm going to show you an example with Node.js and the Mandelbrot. I'm going to actually run in Docker. That's the command. Let me show you first the code, actually. This is the class Mandelbrot. It's just Java, where we have the two annotations. That's the code to run the Mandelbrot computation. We have one method called compute, that we will build the task schedule here. Then we have our method called sequential, that we'll call the sequential code. Then we have our Server.js. On the entry point, I'm using the Express module. On the entry point we'll bring a bunch of messages. The interesting part is here. I am calling a Java type. This is because I'm using Truffle and the polyglot engine. I'm calling the Java type Mandelbrot.compute, which is the one that builds the task schedule to run on the GPU, potentially. I'm printing the time, and that code will actually generate an image. I'm going to print the image.
If I type /Java, I'm going to do the same, but I'm going to run the sequential code. Let's do that. I already started the server. If I go to that direction, the Mandelbrot array there. It might not be that impressive to you. It takes around 1.3 seconds to compile and compute this image. If I refresh the browser, now it's going down to 0.1 seconds, because once we get the code, we just get it from the code cache. Just out of curiosity, let's run it with the sequential one. Any guesses about the time? I'm running now with /Java with OpenJDK. 5 seconds, 10 seconds, 20 seconds? I think you're close. Still running, 17 seconds. I can refresh it, you can get down. The JIT compiler gets seen. This is not going down 15 seconds, compared to 1.4 seconds, including JIT compilation on the GPU. Perhaps you can plug in your video game engine now.
We have this blue box, all the magic happens here. Let's open the blue box. That's what we find. We have our data flow analyzer, our runtime, and we have a big component here. It is not the biggest but it takes a significant amount of time for us to build. It is the JIT compiler. Basically, we extend Graal. It has different IR representations, high-level IR for architecture independent optimizations, memory optimizations, and architecture dependent optimizations. Basically, what we do is we have a control flow graph. There are many nodes in there, for loops, data dependencies, and so on. Basically, what we do is to do node replacement depending on the optimizations we want to do. For example, you want to target the GPU, you have a, for loop. Remove the, for loop, introduce the get_global ID. We actually have, between Graal and Tornado, 170 optimizations in the process. At the end of the process, we have the OpenCL C code. That means that we need another compiler afterwards. We do this just by calling the actual driver. If you're using the NVIDIA GPU, you just call the NVIDIA driver to get the PTX back. If you're using the Intel FPGA, we just call the Intel driver, and it will give just the beta stream, the configuration file back.
TornadoVM JIT Compiler Specializations
One of the things we do is compiler specializations, and actually, that's one of the things that takes most of our time. Let me tell you why we need this. We generate OpenCL underneath. OpenCL is a standard. That means that code is portable, but OpenCL, the performance is not portable, which means that if we don't change, if we don't massage the code we generate, we might not get the performance we want. I show you here one type of specialization we do for loops. This is the input code. This is the graph form. Then we say, you're targeting GPUs. Each loop is going to return from the loop that says, we got fine-grained parallelism, basically. Each thread is going to compute its own element. If we target multi-cores, each thread is going to compute a range of elements. We do these specializations directly in the IR. I show you in Java code for simplicity. We do this transformation directly in the IR. We have many optimizations for that.
FPGA Specializations
Let me show you what we do for FPGAs. Actually, this is very important because if we don't specialize the code for CPUs or GPUs. That's fine. You're going to get performance anyways, not that good, but still much higher than HotSpot. If we don't do optimization for FPGAs, we most likely are going to get slowdown, not even speedup, not even 1x. Let me show you what we do to get speedup. That's the Graal IR style. One of the things we do is we introduce thread scheduling, in the IR level, which means that the IR has the knowledge of how many threads we want to execute. For example, a block of 32 by 32, 64 by 64, things like that. Then we can tune the loop unroller. Graal has a good one, but it's good for CPUs. For FPGA, it's not that good. We tune it. Just by introducing a new loop unroller, we can actually save some physical space in FPGAs. That can give you a better speedup. We have a bunch of optimizations. Just by doing that, we go from slowdown to 240x speedup. We have seen in our benchmark. The server we execute was a 4 core machine. Now you might want to use TornadoVM for some workloads.
TornadoVM: VM in a VM
I'm going to switch the context a bit. I want to prepare the background I need to explain to you how we can perform live task migration, which is the other big part that Tornado can do. Many times we define Tornado as a VM in a VM, like Inception. If you execute your Java program, but you don't have your task schedule defined, you might get something like OpenJDK or Graal. That's fine. If your code gets hot, potentially, you're going to reach the specialized code, the compiled code for CPUs. If you have your task schedules, we have something like this. We are going to trigger the Tornado compiler. We have our data flow analyzer and optimizer. Then we generate new bytecodes. Because we are in this mode, we manage memory. We manage execution. We manage compilation. We manage task migration. That's why we say that we have a VM in a VM. That's why we say actually, it's a plugin to OpenJDK. With this strategy, we can do task migration across many devices.
We had your class compute, for example, and you have map and reduce. We build a task schedule. In this case, we have two tasks. One pointing to map method, the other pointing to the reduce method. Then I want to highlight that we pass input/output. The output of the first is the input of the second. We don't use it anymore, which means that we could potentially optimize it. That's what we do in the graph analyzer. When we do the graph analysis here, of the data flow. Basically, we have the data coming into the first method, then we produce a result. We don't need the result coming from here, we just can keep it on the device. We have the second method. Finally, we synchronize with the host again. We send the results back.
Once we have the graph optimized, we generate bytecodes. The bytecodes are pretty simple bytecodes. Everything is enclosed between begin and end. Plus, we pass an index. This guy is just the initial hint, initial device index to run. It could be a GPU. We can change this at runtime, by default we take a decision. It's a simple process, we just revise the graph. We say COPY_IN the first variable. Then allocate a space for a second. We don't need to copy anything. Many of the bytecodes are non-blocking, which means we need a way to block until the data transfer is finished. Then we launch the first method. The first time we execute these bytecode we call Tornado compiler, and we compile from bytecode to OpenCL, and so on, until we get the final thing.
The Tornado bytecode is a very simple way to orchestrate execution on heterogeneous devices. Why? Because we can do more complex things. Let's imagine this scenario. This is very typical. We want to process 16 Gigabytes of data on a GPU that only has 1 Gigabyte. This is very typical. GPUs have only a limited amount of memory. The way we do that is through an API call called batch. We can batch execution. We have three arrays. We say batching in 300 Megabytes each, so 900 Megabytes. They fit into 1 Gigabyte, we can process it. Underneath, what's happening is, we unfold all the bytecodes in batches, first batch, copy in execution, copy out, second batch, and so on. We haven't changed the compiler. We haven't changed the runtime. Nothing. Just how the bytecodes are running. That's all. That's a very powerful capability.
Dynamic Reconfiguration
Let me now switch to the task migration. I have all the ingredients that I need to explain this. Because we have the ability to compile and run for many devices, what we are going to do is we are going to spawn a set of Java threads. Each thread is going to target one particular device. Each thread is going to compile and run. We also keep another thread to run the sequential application. This is because we want to switch only if the application runs faster on the device, only. If it's not faster, just keep with HotSpot. They do a very good job. We keep one thread to run the sequential with HotSpot. Then we have a component here to make the decision. In fact, what each thread is running is an instance of the bytecodes, which is copy in execution, copy out for all the methods that you have. Then the TornadoVM will tell us, I know you're running with the sequential thread, but now if you switch to this device, at this time, you're going to run faster than it used to be.
How is this decision logic made? We introduce policies. We have three for now. We focus on performance. We have end-to-end performance, peak performance, and latency. The end-to-end performance will encounter in the decision, JIT compilation and execution, including data transfer. Peak performance will not encounter JIT compilation, only execution. Then we'll make the decision. These two modes wait for all the devices in order to make a decision. That could be very slow. That's why we introduce a third policy called latency, which means, I spawn a set of Java threads. The first one to finish, just cover it.
Demo: Live Task Migration - Server/Client App
We have another call in the API that we can switch the device. That's what I'm going to show you. The internal logic is the same. Let me start with the code. I have a server client application, .java. The server will open a socket. The server will try to run one task on the GPU. It is very simple, it's just vector addition. The run method will wait for the client to tell in which device we want to run. Then the client can change the device at any point at runtime. The server will still be open. All the examples I've shown are available on GitHub, you can easily reproduce and read the code.
On the left-hand side, I'm going to run this server. That's the flag I use. I'm going to print the kernel. I'm going to run the debug just to tell me which device is running. On the right-hand side, I'm going to run the client. It stopped there to wait for a device which would run the task. Let's run in device 0. I would select the GPU, compile and run. Now I select the second device. It is an Intel CPU multi-core. Then the next one is a multi-core but with different OpenCL drivers, compile and run for all of them. Finally, the integrated graphics. I have been switching without restarting the application. In fact, now I can switch. Let's run on 0 again, on NVIDIA. I didn't compile. I just run it. The code is already in the code cache. Obviously, binaries are different because devices are different. I can run this one on device 1, this one on 0 again, 0 again. This one on the Intel graphics. This one on the NVIDIA. This one on the multi-core, and so on. I haven't restarted the application.
New Compilation Tier for Heterogeneous Systems
In fact, what we propose is something like this. Are you familiar with HotSpot? How HotSpot compiles code? You might have that. For example, OpenJDK, have a few compilers, C1, C2, by default is C2. You might have the Graal compiler. As soon as you reach the maximum level, C2 or Graal, you cannot get faster. That's not possible. However, if we plug in the dynamic reconfiguration, we might get something faster after all. That's the whole idea. TornadoVM will say, I know you're running with the code that C2 produced. If we switch to multi-core, you're going to get faster. Then over again, if you switch to a GPU, you're going to get even faster. Why this switching? Because you might run an expression with different input sizes, for example. Depending on the input size, you might switch devices. Don't run on a GPU, if you only have a few elements to run. At least 1000 threads to run on a GPU, otherwise the GPU is on holiday.
Just to mention a few related works in the context of Java. There are plenty out there. You might hear about Aparapi project, or IBM J9, they support only GPUs. Aparapi supports multi-core as well because they target OpenCL. There are a few other projects. This one actually is my PhD thesis. Tornado is the one that supports more devices because we can support GPUs, FPGAs, and multi-core. We can perform task migration. As far as I know, there is no other one that can do that. We can perform specializations at runtime. I know that we're doing that for the project. We can target also dynamic languages. I put grCUDA here for reference. grCUDA is a framework on top of Truffle, but the code you'll write is CUDA, not JavaScript or R. There are a lot of differences between grCUDA and TornadoVM. With Tornado you just write Java, which I believe is much higher level.
Let me show you performance in a real setup. I want to show you the dynamic reconfiguration in action in a server, and how to read this graph. X-axis shows input size, y-axis shows speedup against HotSpot. I have two applications, DFT and NBody. The dots or the squares is I run Tornado on that device without switching. I don't care. I just run on this, no matter what. The line is, I run Tornado with the ability of doing the task migration. Let's focus on the DFT end-to-end policy. Tornado will say, for small input sizes stay with HotSpot. They do a very good job. As soon as the data increases, TornadoVM will switch execution from HotSpot to the GPU. We'll continue from there. A similar situation applies for other benchmarks.
We can run on many devices. In fact, we can also run on AMD GPUs. One important thing I want to highlight, Tornado doesn't get performance for all benchmarks. In fact, for example, SAXPY, where you have a huge amount of data to copy to the device, one or two operations to do on the device, and a huge amount of data to copy back. That doesn't work for GPUs or FPGAs. In fact, Tornado will get slowdown. For other types of applications, Black-Scholes, Mandelbrot, NBody for physics, matrix multiplication for deep learning, DFT stuff, Tornado is a very good suite for running on heterogeneous devices.
If you're interested, we have a bunch of papers. Those are available in our GitHub repository. Actually, there are more coming.
Limitations
We have limitations. Most of the limitations are bound because of the programming model we use underneath. We use OpenCL underneath. OpenCL doesn't support, for example, recursion. We don't support recursion on the device. We don't support objects. We support some type of objects, the objects we know the data layout. For example, the matrix multiplication, remember the matrix 2D, 3D, those are objects. We support those. We don't support dynamic memory allocation. In some cases, we do. Why? Because we compile at runtime. At runtime, if we know the size of the array, we can simulate dynamic allocation, but it's not. We don't support exceptions. We might support exceptions in the future. This is a bit controversial. These devices are not thought for running this stuff. If we want to force these we might not get the speedup we want. The philosophy we took with TornadoVM is, run when your workload makes sense to run in there. Otherwise, just stay with HotSpot. They do a very good job. Rather than competing against HotSpot, it is a complement to HotSpot or Graal.
Future Work
For future work, we have many things in progress. We are trying to integrate more specializations in the JIT compiler. GPUs have different memory tiers, like constant memory, local memory, global memory, and private memory. Other projects like Aparapi, what they do is they expose a set of API calls for each individual region. We have a PhD student working on that. What he's doing is, automatically, in the IR, exploit which levels to insert and automatically generate the code for that. That will give you very good speedups in an automatic manner. The other thing we want to investigate is energy policies. We focus now on performance. Think about this, if we can switch not because of performance requirements, but energy requirements, run this method on this device because you consume less energy. That would be cool. This is work in progress. We're working on that. For example, we're plugging in a PTX backend underneath. Instead of OpenCL, run with NVIDIA only. Hopefully, we can get slightly better performance.
How TornadoVM is Currently Being Used in Industry
TornadoVM is part of a European project called E2Data. What we do is to run Apache Flink MapReduce workloads on a heterogeneous cluster, automatically. For that, we plug in Tornado on the final node. The final node will contain a bunch of heterogeneous hardware. Ideally, you will just type your MapReduce computation with Flink. No changes in the programming model, no extensions, and everything will run, if we can, on heterogeneous devices, automatically. This is work in progress.
Then, there is a company called EXUS. It's here in London. What they do is, they use Tornado for machine learning. The goal of the application is they want to predict the number of patients to be readmitted in a particular hospital, to predict resources, to predict medicals, so doctors and so on. For doing that, what they did is they trained a machine learning model in Java. The sequential code runs around 2600 seconds. By using Tornado, they're going down to 188 seconds, so 14 times faster.
Tornado is also open source. It's fully available on GitHub. Check it out. We had the last release. In that release, we introduced these dynamic languages with GraalVM. We also have Docker images. In fact, the demo I showed with Node.js is through Docker. My personal advice is if you want to give a try to Tornado, just use Docker images because you don't have to mess around with the driver installation. We have two types for GPUs by NVIDIA, and for Intel integrated graphics.
We have a team. Mostly, we are research and academic staff, and PhD students. They are working on different aspects of Tornado, like the optimizations, the Flink part. We have a bunch of students recently working on how to accelerate deep learning with Tornado. We are also looking for feedback and collaboration. We want to hear what you think about this, if you're missing something.
Takeaways
Today's computing devices are heterogeneous. Heterogeneous devices are everywhere, there is no way to escape. The thing is how to program efficiently. I showed one alternative to program by using Tornado. With that, you can target any high-level programming language. By that I mean Java, R, Ruby, and run it transparently on heterogeneous hardware. I have also shown a way to do task migration. You can get very high speedups. We're not talking about 100x, we're talking about 1000x, 4000x. This is because we run with the capabilities of the device.
Questions and Answers
Participant 1: I noticed when you ran one of your examples that your computer said your GPU memory is running low. Does it do any automatic cleanup? You're copying code to the GPU array. Does it do cleanup of that as well?
Fumero: We run Tornado in isolation. Obviously, after running Tornado, we do a cleanup phase, obviously.
Participant 2: One of your slides, when you had the matrix comparing all of the different projects. It said that Tornado wasn't production-ready, so not yet(*). What do you need to do?
Fumero: We are pushing for something else. I believe we support more things than others. For now, we are running as a part of an academic project. What we need to do is to get a use case from a company and make them useful. One of those is EXUS. Hopefully, they can make it to production, for now, because we're running on the European project. For now, it's academic. That's what I mean by not yet. For me, I will say yes, of course.
Participant 3: Any thoughts on using TPUs?
Fumero: If the TPU supports OpenCL, yes, why not? We were thinking to introduce LLVM IR, and perhaps go through this project by Google that targets TPUs from this IR. We don't have students working on that right now. That's one of the internal things that we're discussing to get more backends. One of the backends we're building right now is the new one, the PTX.
See more presentations with transcripts