Hortonworks Announces Hive 0.13 with Vectorized Query Execution and Hive on Tez
Hive is an open source SQL Engine written on top of Hadoop that lets users query big data warehouses by writing SQL queries instead of MapReduce jobs. The latest release introduces significant performance improvements, as well as some new SQL features.
The new release also marks the completion of the Stinger initiative that spanned releases 0.11, 0.12, and 0.13. The initiative’s goal is to increase query performance by a hundred times as compared to Hive 0.10, while allowing Hive to scale to petabytes of data.
Hive on Tez allows users to take advantage of a more efficient query planner, resulting in faster queries. Tez is an application framework built on top of Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. Some optimizations implemented by Tez include:
- Executing MapReduce shuffle in-memory instead of on-disk for small data sets.
- Pipelining multiple Reduce phases together without the need to produce intermediate HDFS files.
- More efficient distributed joins.
- Faster process startup and initialization through container re-use.
Vectorized query execution further improves Hive performance. This database optimization technique was popularized by the MonetDb/X100 project, and is explained in detail in the following whitepaper. Vectorized query execution works by processing data in batches of about a thousand rows, instead of one row at-a-time. This results in more efficient CPU usage and lowered deserialization overhead.
Hive 0.13 adds native Parquet support. Parquet, a columnar format for Hadoop, is already supported by a number of existing big data technologies, including Apache Drill, Cloudera Impala, and Apache Spark. Parquet’s compressed, efficient columnar data representation lowers the amount of data Hive has to scan through, resulting in faster query execution time.
Hive 0.13 also fixes several bugs related to support of the Optimized Row Columnar (ORC) file format. ORC file format provides an efficient way to store and query data in Hive by compressing the data in a columnar format. It was introduced in Hive 0.11 and provides performance benefits that are similar to the Parquet file format.
Hive 0.13 introduces SQL standard-based authorization. With the SQL standard-based authorization, users can now define their authorization policies in SQL-compliant fashion. The Apache Hive community extended SQL language to support grant and revoke on entities. Hive now supports show roles, user privileges, and active privileges.
Other improvements in Hive 0.13 include:
- Addition of new data types like Date, Timestamp, Decimal, Char, and Varchar.
- Support for unqualified column references in joining conditions.
- Support for subqueries inside clauses like IN, NOT IN, EXISTS and NOT EXISTS clauses.
- Support for Permanent Functions.
- Support for join conditions in the WHERE clause.
- Operator-level cardinality estimation.
Release notes for Hive 0.13 can be found here.
Shane Hastie on Distributed Agile Teams, Product Ownership and the Agile Manifesto Translation Program
Shane Hastie Apr 17, 2015