Google Improves Hadoop Performance with New Cloud Storage Connector
Developers do big data processing with Hadoop in a variety of cloud environments, but Google is trying to stand apart through investments in the storage subsystem. With a new connector, it is now possible for Hadoop to run directly against Google Cloud Storage instead of using the default, distributed file system. This results in lower storage costs, fewer data replication activities, and a simpler overall process.
Google has played a role in Hadoop ever since unveiling research on the Google File System in 2003 and MapReduce in 2004. Google’s commercial Hadoop service, called Hadoop on Google Cloud Platform, is described as a “a set of setup scripts and software libraries optimized for Google infrastructure.” Hadoop jobs typically use storage based on local disks running the Hadoop Distributed File System (HDFS), but this new Google Cloud Storage Connector for Hadoop attempts to simplify that. The connector makes it possible to run Hadoop directly against Google Cloud Storage instead of copying data to a HDFS store. Google Cloud Storage – based on Google latest storage technology called Colossus – is a highly-available object storage service for storing large amounts of data in US and EU data centers.
In a blog post announcing the connector, Google outlined a series of benefits of using Google Cloud Storage instead of HDFS for Hadoop processing. These benefits include faster job startup, higher data availability, lower costs due to fewer copies of data, data that survives after the Hadoop cluster is deleted, removal of file system maintenance activities, and better overall performance.
Hadoop performance was the focus of a guest post from Accenture’s Technology Labs on the Google Cloud Platform blog . Accenture took the new Google Cloud Storage Connector for Hadoop for a spin and published their results. They devised a benchmark to compare Hadoop performance on bare metal servers versus cloud environments. Their initial setup on the Google Compute Engine followed Hadoop best practices, but was relatively complex.
In conducting our experiments we used Google Compute Engine instances with local disks and streaming MapReduce jobs to copy input/output data to/from the local HDFS within our Hadoop clusters.
As shown in Figure 1, this data-flow method provided us with the data we needed for our benchmarks at the cost of the total execution time, with the additional copy input and output phases. In addition to increased execution times, this data-flow model also resulted in more complex code for launching and managing experiments. The added code required modification of our testbench scripts to include the necessary copies and extra debugging and testing time to ensure the scripts were correct.
The Accenture team was approached by Google to use the new connector in their tests, and the Hadoop setup was modified to take advantage of it.
Once the connector was configured we were able to change our data-flow model removing the need for copies and giving us the ability to directly access Google Cloud Storage for input data and write output data.
Figure 2 shows the direct access of input data by the MapReduce job(s) and ability to write output data directly to Google Cloud Storage all without additional copies via streaming MapReduce jobs. Not only did the connector reduce our execution times by removing the input and output copy phases, but the ability to access the input data from Google Cloud Storage proved unexpected performance benefits. We were able to see further decreases in our execution times due to the high availability of the input data compared to traditional HDFS access.
Across a range of Hadoop tests, the new connector shaved significant time off of the execution duration. The savings resulted from removing the need to copy input and output data, having consistent data distribution among nodes, and faster data transfer. The last result is fairly surprising as one might not expect remote storage to outperform local storage.
This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.
While counterintuitive, our experiments prove that using remote storage to make data highly available outperforms local disk HDFS relying on data locality.
This Accenture study showed that Hadoop processing on the Google Cloud Platform had a better price-performance ratio than bare metal servers, and that performance tuning of cloud servers is a complex, but valuable exercise.