Hadoop Jobs on GPU with ParallelX
The MapReduce paradigm is not always ideal when dealing with large computationally intensive algorithms. A small team of entrepreneurs is building a product called ParallelX to solve that bottleneck by harnessing the power of GPUs to give Hadoop jobs a significant boost.
ParallelX is “a GPU compiler that translates the code you’ve written in Java to OpenCL, and executing it on our AWS GPU cloud”, says co-founderTony Diepenbrock. The end product is a service similar to Amazon’s Elastic MapReduce, except it will make use of the EC2 GPU instances.
Amazon is of course not the only cloud provider proposing GPU servers, and companies like IBM/Softlayer or Nimbix also offer servers with NVidia GPUs. When asked whether ParallelX is going to support different providers than Amazon, Tony replies “Not any time soon, but we will have an SDK available for customers with in-house Hadoop clusters. Most of the GPU cloud providers offer GPUs in HPC clouds but we want cheap GPUs in the cloud. After all, that is what Hadoop is designed for–cheap commodity hardware.”
To understand a bit better what the ParallelX compiler does, it is important to note that there are different types of GPUs along with different parallel computing platforms such as CUDA or OpenCL. Where ParallelX fits in is that, as Tony mentions, the “compiler will translate the JVM bytecode to OpenCL 1.2 code, which will then pass through the OpenCL compiler to get compiled into shader assembly to execute on the GPU. There is now FPGA hardware that also is capable of running OpenCL code, but support for generalized parallel hardware will be supported in the future.” Even if it does not support reflection or native calls in the Java source code, ParallelX’s goal is to ensure that developers have to make as little code changes as necessary to their MapReduce jobs.
As the ParallelX team is looking into increasing throughput for I/O-bound jobs, Tony notes that they are “also going to support real-time processing, queries expressed in Pig and Hive code, and streaming of large data sets for I/O bound jobs. Using our pipelining framework, I/O throughput almost reached the computing throughput of the GPUs in our tests.”
Although they are focusing their efforts on the Amazon Hadoop distribution, the team is planning to target other popular Hadoop distributions such as Cloudera's CDH, and it will undoubtedly be useful to take advantage of the many improvements to Hive and Pig offered by these commercial distributions in the context of ParallelX.
The story of ParallelX is one of a kind, and Tony relates the odyssey of this project over the past 2.5 years in an article, starting with a social network for fraternities, a widget for Facebook and culminating into a tool to identify plagiarized code. These projects had something in common: graph analytics and algorithms on GPU, which is where the idea for ParallelX came from almost naturally.
There are many different workloads that can be a good fit for ParallelX. The focus is on high-performance computing, for example Machine Learning and heavy analytics like graph processing. As an illustration of its capabilities, the ParallelX team was able to cluster in under one second a large fraternity network on a single GPU, something which used to take an hour parallelized across six computers otherwise. But in practice there is no limit, as anything that is written for MapReduce can be compiled to the GPU with ParallelX.
The ParallelX team is planning to publish its data and a whitepaper in the future to demonstrate the performance of their Hadoop-to-GPU compiler on real-world workloads. The community response has been slightly divided on the topic, and some are waiting for this whitepaper before switching gears, as can be seen in the comments when this was posted on Hacker News: “Extraordinary claims require extraordinary evidence.”
You can already get a taste of the power of GPU on Hadoop by using Aparapi, a Java API that allows you to execute specific code fragments on the GPU by converting the Java bytecode to OpenCL, which could be embedded in any MapReduce job written in Java.
ParallelX could be a significant step in democratizing Hadoop to a research-oriented audience with needs for increasingly complex algorithms. Graph analysis algorithms for example can get very good performance today using theBulk Synchronous Parallel (BSP) model democratized by Apache Hama, and if ParallelX can be combined to projects like Apache Giraph which runs graph algorithms as MapReduce jobs, it could be a worthwhile addition to any data scientist’s graph analytics toolkit.
You can sign-up for the beta online by entering your email address. ParallelX is planning to support a freemium plan with limited storage but access to powerful GPUs.
Shane Hastie on Distributed Agile Teams, Product Ownership and the Agile Manifesto Translation Program
Shane Hastie Apr 17, 2015