Interview with Arun Murthy on Apache YARN
Apache Hadoop YARN – a new Hadoop resource manager - has just been promoted to a high level Hadoop subproject. InfoQ had the chance to discuss YARN with Arun Murthy - founder and architect at Hortonworks.
It was recently announced that Apache Hadoop YARN (Yet Another Resource Negotiator) has become a subproject of Apache Hadoop in the ASF.
“Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the subprojects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a subproject of Hadoop.”
The initial motivation for YARN was to fix major deficiencies of the MapReduce implementation, including improved scalability (support for clusters of 10000 nodes and 200,000 cores), reliability and cluster utilization.
YARN fulfills these requirements by splitting the two major functions of the Job Tracker, resource management and job scheduling/monitoring into two separate daemons - global Resource Manager (RM) and per-application Application Master (AM), where an application is either a single job in the classical sense of MapReduce jobs or a direct acyclic graph (DAG) of jobs (compare to Oozie).
As everything else in Hadoop, YARN’s resource management and execution framework is implemented as a master/slave paradigm - the Node Manager (NM) runs and monitors every node and reports resource availability to the Resource Manager, which is the ultimate authority that arbitrates resources among all the applications in the system (compare to the HDFS architecture).
Execution of the specific application is controlled by the Application Master, which is responsible for splitting an application into multiple tasks and negotiating with the Resource Manager for execution resources. Once a resource is allocated, Application Master works with the Node Manager(s) to place, execute and monitor individual application’s tasks.
The YARN driver uses an 'Application Submission Client' to submit an 'Application' to the YARN Resource Manager. Clients use the 'ClientRMProtocol' to first acquire a new 'ApplicationId' and then submit the 'Application' to be run. Application submission contains the information describing Unix process(es) that needs to be launched by the Application Master. This information includes details about the local files/jars that need to be available for your application to run, the actual command that needs to be executed, any Unix environment settings, etc. The details of writing a YARN driver can be found here.
The important thing to notice about Yarn is that it does not change the actual MapReduce programming model or it’s APIs that are used by the application developers. What it provides is a new resource management model and implementation, which is used to execute MapReduce tasks. As a result, in a most simplistic case, existing MapReduce applications will work (requires recompiling) as is, but Yarn will allow developers to more precisely specify execution parameters.
Alternatively, YARN can be used for the creation of new frameworks and execution models (in addition to MapReduce) that can leverage both the compute power of a Hadoop cluster and its rich data storage models to solve specific new classes of problems. Such new frameworks can leverage YARN’s resource management, and provide a new implementation of the Applications Manager. Moreover, such architecture allows for coexistence of multiple Applications Managers sharing the same Hadoop cluster and data residing on a cluster.
InfoQ had the chance to talk with Arun Murthy, founder and architect at Hortonworks, about YARN and its future:
InfoQ: So far, Hadoop was very different from many other things, especially application servers, in limiting the number of layers. This “thinness” is what makes Hadoop execution so fast. Will the addition of extra layers (containers) start slowing the Hadoop-based execution?
Arun: Not really, I wouldn’t think so. YARN is all about moving distinct functionality to distinct services (or daemons if you will)… for e.g. global cluster resource management is now purely the function of the ResourceManager, which is very distinct from application life-cycle management, which is the domain of the new ApplicationMaster.
As a result, the services and responsibilities are even simpler and easier to scale.
In fact, performance for MR applications running in YARN is significantly better already and there are considerably more opportunities for further optimizations. Some details can be found here.
InfoQ: One of the strongest features of Hadoop and MapReduce have been their simplicity and strict separation of concerns between application developers and the framework support. When I look at the examples of YARN APIs, they are nothing but simple and require good understanding of many system components. Are you afraid that this complexity will limit attractiveness of YARN?
Arun: We’ve strived very hard to keep YARN system itself (i.e. ResourceManager & NodeManager) very simple, as you noticed. This allows for significant scaling (we’ve simulated 10,000 node clusters) and performance. Keeping this interface bare or raw, so to speak, allows for high-performance applications, which will be written by strong technical communities. Also, the target audience for YARN itself is slightly different, especially compared to target audience for MapReduce applications. As a result we expect more ‘simple’ APIs like MapReduce and MPI to flourish on YARN there-by continuing to provide end-users with simple interfaces.
YARN offers a simple, full-featured apis for application developers while MR provides a simple api for end-users.
InfoQ: In the current Hadoop implementations we are rolling out a lot of custom Input format implementation in order to better utilize cluster. Are there any limitations on this flexibility imposed by YARN?
Arun: YARN does not impose any changes on the MapReduce execution. As a result everything that is currently implemented in MapReduce will continue to work.
InfoQ: Are you planning any direct support for C++ Mapper/Reducer implementation? Current Java-centric APIs force everyone to use JNI for any serious C++ calculations, which is not the most convenient approach.
My sense is that if we can attract a wide community on top of YARN, we will see more folks pick up the baton, so to speak, and we can hope to see other alternatives. That is one of the key goals of YARN i.e. allow for innovation on top of the YARN system without the need for the core Hadoop community itself to deliver all possible innovation…
InfoQ: It’s my understanding that YARN is still in beta status. When are you planning it to be production ready?
Arun: Indications are very good from my vantage point. My sense is late this year, or early next year, is the when we can confidently say YARN, and hadoop-2.x itself including HDFS HA, will be production grade. It’s exciting we are very close! Particularly since I’ve personally spent nearly 2 years in the YARN shaped cave developing this! J
InfoQ: What additional resource measures (in addition to memory) are you planning to introduce and when?
Arun: Glad you asked! I already have a patch for adding multi-resource scheduling (currently, as you said, it’s only memory) to YARN already.
I should commit that soon! This will allow for scheduling Memory and CPU. After that we will be in position to add more overtime trivially… e.g. disk or network I/O, GPUs etc.
InfoQ: Which additional frameworks (besides MapReduce) are you currently planning to bring into YARN? Specifically, are there any plans to add an application manager for Apache Hama?
Arun: Oh, there are several open-source projects in various stages of getting ported to YARN. Specifically w.r.t Hama – I believe the Hama community has already done the necessary work to get Hama working within YARN in trunk (HAMA-431) and we should see a release very soon!
Having said that, our aim, as the YARN folks, is to encourage other communities to port over to YARN, not necessarily do the work ourselves! J
I believe that opening Apache Hadoop via YARN up will be significant driver for further innovation in the Big Data Hadoop ecosystem – YARN is particularly attractive since it has the following main strengths:
- It’s co-designed and co-developed, and released, with HDFS. There-by it opens up all (hundreds of PBs) of data on HDFS for various applications.
- YARN is also the only large-scale, general-purpose resource management framework I know of which specifically aims at solving the data topology problem which is key for Big Data applications like MapReduce. This allows applications to be simpler and more efficient when it comes to crunching terabytes and petabytes of data – happy to talk more about this specifically.
- Since it’s the processing arm of Hadoop now (HDFS – Storage, YARN – processing), it will be present in lots of datacenters – there by making it fairly ubiquitous and attractive for current and future developers as a core, open-source, platform.
- It’s Hadoop Scale i.e. will work on thousands of nodes, efficiently – another key requirement for Big Data.
- YARN itself handles the really tough problems of resource management (where are the free resources in the cluster, who should I allocate them to), fault tolerance (which nodes are up or down etc.) and scaling. This frees up application framework developers to work on their framework than worry about the nitty-gritty details. Essentially, it’s the same MapReduce story (i.e. MR is a simple aim for end-users), but repeated for application-framework developers.
The emerging YARN use-cases in open-source:
Alternate programming paradigms to MapReduce which are being worked on to integrate with YARN:
- UCB Spark
- Giraph (graph processing based on Google Pregel)
InfoQ: What does Apache Hadoop YARN mean for the Hadoop ecosystem?
Arun: In many ways it's a signal from the Hadoop community that YARN opens up Hadoop beyond MapReduce and that we are confident that we can support other communities who want to base their projects on YARN.
YARN, much like MapReduce before, does all the heavy-lifting of resource-management, cluster management, fault tolerance, scheduling etc. and allows the target communities to focus on the details of the application - several such target communities including MPI, Apache Giraph, Apache Hama, Spark and many more see the value and are in various stages of integrating with YARN. Furthermore, it will open the whole Hadoop community for them too i.e. allow them access to data which is already in HDFS, a win-win for both Hadoop and those communities.
About the Interviewee
Arun C. Murthy is VP, Apache Hadoop at the Apache Software Foundation i.e. Chair of the Apache Hadoop PMC and has been a full time contributor to Hadoop since the project inception in 2006. He is also the lead of the MapReduce project and has focused on building NextGen MapReduce (YARN). Prior to co-founding Hortonworks, Arun was responsible for all MapReduce code and configuration deployed across the 42,000+ servers at Yahoo!. In essence, he was responsible for running Apache Hadoop’s MapReduce as a service for Yahoo!. Also, he jointly holds the current world sorting record using Apache Hadoop. Follow Arun on Twitter: @acmurthy
Note: you can find the article image source here.
SQIAR Business Intelligence Services & Consultancy