Arun Murthy, the Yahoo! Hadoop Map-Reduce development team lead recently announced and presented a redesign of the core map-reduce architecture for Hadoop to allow for easier upgrades, larger clusters, fast recovery, and to support programming paradigms in addition to Map-Reduce. The redesign of the Hadoop core for map-reduce splits the engine into a resource manager that supports a variety of cluster computing paradigms, and makes map-reduce a user library, allowing an organization to run multiple versions of map-reduce code in the same cluster. The new design is quite similar to the open source Mesos cluster management project - both Yahoo! and Mesos commented on the differences and opportunities.
The main benefits of the new approach are:
- Scalability: to support clusters of 6000 machines with 16 cores, 48 GB of RAM, and 48 TB of disk, 100k concurrent tasks, and 10k concurrent jobs
- Availability: currently the Job Tracker is a single point of failure, and upgrades require stopping the whole cluster and upgrading
- Agility: the new design makes map-reduce a user library, to allow running jobs that require different versions on the same cluster
- Lower Latency: the new design should allow for faster response, especially for smaller-scale tasks
- Better Utilization: a more sophisticated resource and scheduling model allows for reduced waste of resources, without contention
- Alternative Programming Models: Murthy noted there is a lot of demand for other paradigms at Yahoo, such as MPI
The core element of this redesign is to split responsibilities into a generic cluster resource management system, and a separate application master for map-reduce, or indeed any other programming model. This will replace the Job Tracker and the Task Tracker. The resource management system consists of the following cluster-wide controllers:
- a ResourceManager, which performs cluster-wide scheduling of resources such as memory, CPU, disk, network, etc.
- a Scheduler plug-in, which can implement different policies for the ResourceManager (akin to the current scheduler API, but with a different interface and requiring new implementations)
- one ApplicationMaster per application (such as map-reduce programming) which requests resources, tracks progress, handles failures, and can keep state about computations
Next Generation MapReduce Architecture. Source: Yahoo
Distributed across the worker nodes, there will be:
- a shared NodeManager, which gives access to work node resources (e.g., by authenticating requests and starting tasks)
- per application Containers (akin to tasks), which use local resources to perform computations
- high availability by using ZooKeeper to save cluster state, allowing fast failover to a back up resource manager. Any ApplicationMaster can checkpoint state into HDFS. For example, the map-reduce Application Master will save state and allow quick recovery if it crashes.
- backwards compatibility through well-defined wire protocols, allowing for updating even the ResourceManager and NodeManagers incrementally in a cluster, as well as running different versions of map-reduce or other programming paradigms concurrently. Arun noted that Yahoo! researchers are often running MPI, Master-Worker, and iterative models, and that this will allow for innovations in programming models like the Hadoop Online Prototype.
- better utilization, replacing fixed map and reduce slots with a model whereby underlying resources like disk and CPU are modeled, to avoid waste or contention.
In a follow up discussion Murthy said that the system will not use Linux containers to enforce resource limits, since their tests showed they were too high overhead, but that they will use Linux cgroups to sandbox processes, enforcing container processes adhere to the limits they requested. The system will allow for locality-sensitive requests (e.g., to run tasks on a machine or a rack near a file), to maintain locality in map-reduce jobs, which is also desirable for other distributed programming paradigms. To support the map-reduce shuffle phase, the NodeManager will allow remote tasks to issue remote requests to read local data on their disk. Initially the map-reduce reduce containers will ask for splits from a single NodeManager for each map task that ran on that node, to improve efficiency. Murthy mentioned that in the future they want to add a coalescing operation that will sort together all the map task data on a single node and also possible on a single rack, to further improve the efficiency of shuffles.
Murthy noted that the technology is currently in a prototype stage of development. Yahoo! is eager to deploy it this year,to allow larger clusters to be deployed, although this is dependent on internal processes to release the code and coordinating a release for Apache Hadoop. After Arun presented on the technology at the February Bay Area Hadoop User Group meeting, Ted Dunning of MapR Technolgies asked if Mesos doesn't already do all this. Murthy responded that Mesos needs both a JobTracker and a TaskTracker to run map-reduce, whereas this implementation replaces the need for a JobTracker or a TaskTracker. We asked the Mesos team to comment.
I think it would be best to take advantage of this refactoring of MapReduce to let it support multiple resource managers -- not just for Mesos's sake but also because people naturally want to run Hadoop on things like Sun Grid Engine, LSF, Condor, etc. Resource scheduling for generic applications is a much harder problem than implementing MapReduce, and there's likely to be a lot of experimentation and different systems for this. Yahoo's refactoring will likely make it easier to run Hadoop on Mesos... In any case, we will remain committed to supporting Hadoop on our end.
...We plan on continuing to grow Mesos into a powerful system. One advantage that I think we'll have with this respect is that Mesos is continuing to be built, from the ground up, to run applications beyond just data processing. At Twitter we are running more and more general-purpose "server" like applications everyday.
In response to the question about how the proposed Hadoop architecture differs from Mesos, Zaharia said:
...one of the major differences is state management in the Mesos master. In Mesos, we have designed the system to let the applications do their own scheduling, and have the master contain only soft state about the currently active applications and running tasks. In particular, the master doesn't need to know about pending tasks that the application hasn't launched yet. ... In the Hadoop next generation design, as far as I understand, the master keeps a list of all pending tasks in ZooKeeper, which leads to more state to manage. Apart from this technical issue, Mesos has been designed to support non-MapReduce frameworks for the beginning, and we have been using it for MPI, long running services, and our in-memory parallel processing system called Spark. One last difference from a practical point of view is that Mesos can be used with Hadoop 0.20 (with a patch) instead of requiring one to wait for a new Hadoop release.
InfoQ spoke to Eric Baldeschwieler, Yahoo!'s VP of Hadoop Development to understand more about the project origins and differences with Mesos. Baldeschwieler said there are many projects that provide cluster scheduling capabilities, including Mesos. However, Yahoo! felt that none of the existing options were designed to solve the map-reduce problem as they conceive of it. He noted that some of the options were new and immature, and those that were mature were missing key features. He commented that the Next Generation Map Reduce design reflected years of experience running the largest Hadoop installation in the world, and was a natural evolution as Yahoo! works incrementally to make map-reduce better.
With respect to Mesos, Baldeschwieler said that it is good work, but that Yahoo! wanted an integrated component to Hadoop, and not a meta-layer. Also, he noted that the Next Generation MapReduce architecture has been in design for years and that the Yahoo! team has been collaborating with Mesos team members throughout their design process and is watching their work with interest. Baldeschwieler said he'd welcome contributions from the Mesos team to the Hadoop project, and would be glad to help them better integrate with Hadoop. He noted that in past Yahoo! has adopted externally developed Hadoop technologies that have proven their value by building a community, such as HBase and Hive.
It will be interesting to see the dynamic as the Hadoop ecosystem evolves to a world of resource schedulers supporting multiple programming paradigms, and addresses increased scalability, reliability, and efficiency.