Hadoop and Metadata (Removing the Impedance Mis-match)
Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?
As a Hadoop user, the need for a metadata directory is clear. Users don’t want to ‘reinvent the wheel’ and repeat the work of others. They want to share results and intermediate data-sets and collaborate with colleagues. Given the needs of users, the case for a generic metadata mechanism on top of Hadoop is easy to make: increased visibility into data assets by registering them with a metadata registry for discovery and sharing enables increased efficiency. Less work for the user.
Users also want to be able to use different tool-sets and systems together - Hadoop and non-Hadoop alike. As a Hadoop user, there is a clear need for interoperability among the diverse tools on today’s Hadoop cluster: Hive, Pig, Cascading, Java MapReduce and streaming Python, C/C++, perl, and ruby with data stored in formats from CSV, TSV, Thrift, Protobuf, Avro, SequenceFiles, Hive’s RCFile as well as proprietary formats.
Finally, raw data does not usually originate on the Hadoop Distributed Filesystem. There is a clear need for a central point to register resources from different kinds of systems for ETL onto HDFS, and to publish results of analyses on Hadoop onto other systems.
Indeed Curt… HCatalog Really DOES Matter
Curt Monash’s recent blog post titled “HCatalog - yes it matters” hits the mark in many ways and is a recommended read. In his post, Curt draws parallels between a Database Management System (DBMS) and the value HCatalog provides for a Hadoop cluster, as a metadata service. While this is spot on analysis, it is also important to point out that HCatalog is equally important as an interface to expose Hadoop to the enterprise application ecosystem.
This post extends the great points that Curt makes with some further information about HCatalog definition, history and usage.
One of the most attractive qualities of Hadoop is its flexibility to work with semi-structured and unstructured data without schemas. Unstructured data represents 80% of the overall data in most organizations and is growing at 10-50x structured data. Indeed, Hadoop excels at extracting structured data from unstructured data. HCatalog helps Hadoop deliver value from the output of its labor, by providing access to mined, structured data by those who would consume it: analysts, systems and applications.
HCatalog is a metadata and table management system for Hadoop. It is based on the metadata layer found in Hive and provides a relational view through a SQL like language to data within Hadoop. HCatalog allows users to share data and metadata across Hive, Pig, and MapReduce. It also allows users to write their applications without being concerned with how or where the data is stored, and insulates users from schema and storage format changes.
This flexibility ultimately decouples data producers, consumers, and administrators, giving them a well-defined common ground upon which to collaborate. Data producers can add a new column to the data without breaking their consumers' data reading applications. Administrators can relocate data or change the format it is stored in without requiring changes on the part of the producers or consumers. New data-sets can more easily find consumers who are informed of their existence via HCatalog.
HCatalog makes the Hive metastore available to users of other tools on Hadoop. It provides connectors for Map Reduce and Pig so that users of those tools can read data from and write data to Hive's relational column format. It has a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements. It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the Hive’s warehouse.
REST Interface for Hadoop
Templeton is a character in the book Charlotte’s Web. He is a gluttonous rat that provides the protagonist (Pig - Wilbur) help, but only when offered food. In Hadoop, Templeton helps HCatalog by providing a REST interface on top of the metadata. It is a REST API for Hadoop that allows external resources to interact with Hadoop without using Hadoop’s own APIs. This gluttonous rat helps us all by opening up Hadoop via one of the simplest, best understood and most common interfaces. It opens Hadoop to application developers.
Templeton is more than a JDBC connector on top of Hive. A REST interface exposes a shared and dynamic metadata layer to both existing and new applications via HTTP. This opens resources mapped in HCatalog and Hive to any resource with an http client.
HCatalog in Use
There are three basic uses of HCatalog.
1. Communication Between Tools
The majority of heavy Hadoop users do not use a single tool for data processing. Often users and teams will begin with a single tool: Hive, Pig, Map Reduce, or another tool. As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on. Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models. Users who start with Pig discover they would like to use Hive for analytics type queries. While tools such as Pig and Map Reduce do not require metadata, they can benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common. When all these tools share one metastore users of each tool have immediate access to data created with another tool. No loading or transfer steps are required.
2. Data Discovery
When used for analytics, users will extract structured information from raw data using Hadoop. They will often use Pig, streaming and Map Reduce to uncover new insights. The information is valuable but typically only in the context of a larger analysis. With HCatalog you can publish results so they can be accessed by your analytics platform via REST. In this case, the schema defines the discovery. These discoveries are also useful to other data scientists. Often they will want to build on what others have created or use results as input into a subsequent discovery. Registering data with HCatalog is an announcement that new data is available.
3. Systems Integration
Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools. Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications. The organization should enjoy the value of Hadoop without having to learn an entirely new toolset. REST services as provided by Templeton opens up the platform to the enterprise with a familiar API and SQL-like language. It opens up the platform.
HCatalog represents the next logical extension for enterprise ready Hadoop. And yes, Curt, it does matter... a LOT!
About the Authors
Alan Gates is a co-founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Gates also designed HCatalog and guided its adoption as an Apache Incubator project. Gates has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary. He is also the author of Programming Pig, a book from O’Reilly Press. Follow Gates on Twitter: @alanfgates.
Russell Jurney cut his data teeth in casino gaming, building web apps to analyze the performance of slot machines in the US and Mexico. Russell is the author Agile Data, (O'Reilly, March, 2013). After dabbling in entrepreneurship, interactive media and journalism, he moved to Silicon Valley to build analytics applications at scale at Ning and LinkedIn. Now he preaches Hadoop at Hortonworks. He lives on a cliff above the ocean in Pacifica, California with his wife Kate and two fuzzy dogs.