Greenplum Pivotal HD Combines the Strengths of SQL and Hadoop

EMC Greenplum has announced Pivotal HD, a new Hadoop distribution including a fully compliant SQL MPP database running on HDFS and being “hundreds of times faster than Hive”.

Pivotal HD contains the usual suspects of a standard Hadoop distribution – HDFS, Pig, Hive, Mahout, Map-Reduce, etc. – but adds a number of other components shown in the architectural snapshot below:

The main component of Pivotal is HAWQ, a MPP (Massively Parallel Processing) relational database running directly on HDFS in Hadoop through a dynamic pipelining mechanism and featuring:

SQL Compliant – supporting all versions of SQL: ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
Row or column-oriented data storage
Query Optimizer – queries can be run on hundreds of thousands of nodes
Fully ODBC/JDBC compliant
Interactive Query – complex queries on large data sets are solved in seconds or even sub-seconds
Data management – provides table statistics, table security
Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files
Deep analytics – including data mining or machine learning algorithms

Gavin Sherry, Sr. Director of Engineering at Greenplum, demoed (see video at ~42’42”) running the following SQL SELECT statement on 1B rows totaling several TB of data on a 60-nodes HDFS cluster in ~13 seconds, providing close to real-time querying capabilities:

SELECT gender, count (*)

FROM retail.order JOIN customers ON retail.order.customer_ID = customers.customer_ID

GROUP BY gender;

According to Donald Miner, Solutions Architect at EMC Greenplum, “HAWQ is hundreds of times faster than Hive”, as show in the next graphic from Greenplum (PDF):

HAWQ solves queries with “sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine.” Miner explains how they made it possible:

We have what we call “segment servers” manage a shard of each table. Several segment servers run on each data node of your cluster. This shard of data, however, is completely stored within HDFS. We have a “master” node that has the job of storing the top-level metadata, as well as building the query plan and pushing the node-local queries down to the segment servers.

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine. HAWQ follows MPP architecture, streaming data through stages in a pipeline, instead of spilling and check pointing to disk (like MapReduce). Also, the segment servers are always running, so there is no spin-up time.

Pivotal HD comes in three flavors (PDF): Enterprise, Database Services and a Community Edition for evaluation purposes.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Write for InfoQ

Rate this Article

This content is in the Java topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter