BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News YARN Brings New Capabilities To Hadoop

YARN Brings New Capabilities To Hadoop

This item in japanese

Lire ce contenu en français

Hadoop 2 is now Generally Available, with YARN bringing ability to build data-processing applications that work natively in Hadoop.

YARN opens up Hadoop to data-processing beyond Map-reduce, by separating concerns of cluster resource management from data processing. This makes a lot of new projects possible. Projects such as Stinger and Tez focus on achieving human-interactive response times for certain scenarios. STORM focuses on stream data processing. Spring has already announced the Spring YARN framework for Java developers wanting to write their own YARN application. By leveraging Hadoop’s storage and cluster management platform, data-processing applications can now allow users to interact with data in multiple ways.

We spoke to Rohit Bakhshi, product manager at Hortonworks, about YARN and what it means for Hadoop users.

Rohit shared a glimpse of what YARN enables – 

Hadoop’s momentum has continued and many more enterprises (not just web scale companies) want to store ALL incoming data in Hadoop, and then enable their users to interact with it in a host of different ways: batch, interactive, analyzing data streams as they arrive, and more. And most importantly, they need to be able to do this all simultaneously without any single application or query consuming all of the resources of the cluster to do so.

By turning Apache Hadoop 2.0 into a multi-application data system, YARN enables the Hadoop community to address a generation of new requirements IN Hadoop. YARN responds to these enterprise challenges by addressing the actual requirements at a foundational level rather than being commercial bolt-ons that complicate the environment for customers.

Going forward, enterprise will be able to deploy multi-tenant multi-purpose Hadoop clusters that meet SLAs across different organizations and application frameworks.

YARN has binary level compatibility for apps using mapred apis but only source-level compatibility for apps using mapreduce api in Hadoop 1.x. Rohit clarifies what this means –

In Hadoop 2.0, clients will submit MapReduce applications to the MapReduce v2 framework that runs on YARN. In Hadoop 1.0, clients submitted applications to MapReduce v1.

These APIs refer to the MapReduce framework available to developers to create MapReduce applications. The org.apache.hadoop.mapred API is the original API and is most widely used in creating MapReduce applications. Any MapReduce v1 applications developed with this API can be submitted and executed in MapReduce v2 on YARN. There is no change needed to the MapReduce application in this case.

The org.apache.hadoop.mapreduce APIs are the newer set of APIs for the MapReduce framework. These APIs are not binary compatible between MapReduce v2 and MapReduce v2 on YARN. Existing MapReduce v1 applications that utilize these APIs will need to be recompiled against the Hadoop2.x Hadoop jars. Upon recompilation, they can be submitted and executed in MapReduce v2 on YARN.

This is further explained in detail here.

Upgrading an existing Hadoop cluster is also supposed to be straight-forward –

Hadoop and HDP (including all the related Apache Hadoop components) support an "in-place" upgrade from HDP 1.3 (Hadoop 1.x) to HDP 2.0 (Hadoop2.x). All existing data is maintained and metadata is upgraded in place and does not need to be migrated. Configurations have evolved from HDP 1.3 to HDP 2.0, and there will be deprecated configuration properties and new configuration properties. Existing configurations in HDP 1.3 will need to be migrated to HDP 2.0.

When asked whether he would worry about companies using Hadoop pre-maturely on smaller datasets, Rohit replied that he sees it differently  –

Hadoop is used in a variety of ways and because it is open source, we see all types of usage. I wouldn't consider this use 'pre-mature'; in fact, many organizations will start with just a small cluster comprised of just a few nodes and several terabytes, but eventually these environments grow and grow and grow until they result in a data lake and provide a modern data architecture. Small clusters are not 'pre-mature' – they are seeds.

You can read more about the new release on the official announcement

Rate this Article

Adoption
Style

BT