BT

Greenplum Pivotal HD Combines the Strengths of SQL and Hadoop

by Abel Avram on Feb 27, 2013 |

EMC Greenplum has announced Pivotal HD, a new Hadoop distribution including a fully compliant SQL MPP database running on HDFS and being “hundreds of times faster than Hive”.

Pivotal HD contains the usual suspects of a standard Hadoop distribution – HDFS, Pig, Hive, Mahout, Map-Reduce, etc. – but adds a number of other components shown in the architectural snapshot below:

image

The main component of Pivotal is HAWQ, a MPP (Massively Parallel Processing) relational database running directly on HDFS in Hadoop through a dynamic pipelining mechanism and featuring:

  • SQL Compliant – supporting all versions of SQL:  ‘92, ‘99, 2003 OLAP, etc. 100% compatible with PostgreSQL 8.2.
  • Row or column-oriented data storage
  • Query Optimizer – queries can be run on hundreds of thousands of nodes
  • Fully ODBC/JDBC compliant
  • ŸInteractive Query – complex queries on large data sets are solved in seconds or even sub-seconds
  • Data management – provides table statistics, table security
  • Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files
  • Deep analytics – including data mining or machine learning algorithms

Gavin Sherry, Sr. Director of Engineering at Greenplum, demoed (see video at ~42’42”) running the following SQL SELECT statement on 1B rows totaling several TB of data on a 60-nodes HDFS cluster in ~13 seconds, providing close to real-time querying capabilities:

SELECT gender, count (*)

FROM retail.order JOIN customers ON retail.order.customer_ID = customers.customer_ID

GROUP BY gender;

According to Donald Miner, Solutions Architect at EMC Greenplum, “HAWQ is hundreds of times faster than Hive”, as show in the next graphic from Greenplum (PDF):

image

HAWQ solves queries with “sub-second response time, while at the same time running over much larger datasets and processing with the full expressiveness of SQL, in the same engine.” Miner explains how they made it possible:

We have what we call “segment servers” manage a shard of each table. Several segment servers run on each data node of your cluster. This shard of data, however, is completely stored within HDFS. We have a “master” node that has the job of storing the top-level metadata, as well as building the query plan and pushing the node-local queries down to the segment servers.

When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine. HAWQ follows MPP architecture, streaming data through stages in a pipeline, instead of spilling and check pointing to disk (like MapReduce). Also, the segment servers are always running, so there is no spin-up time.

Pivotal HD comes in three flavors (PDF): Enterprise, Database Services and a Community Edition for evaluation purposes.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT