Jim Scott’s new article, A tale of two clusters: Mesos and YARN, starts with a description of a fairly common situation that can often be found in a many IT shops today, multiple resource islands:
The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely dedicated to Hadoop and its processes. The second cluster is all resources that are not a part of the Hadoop cluster.
This situation stems from the fact that Hadoop manages its own resources with Apache YARN. Although YARN works great for Hadoop clusters, its usage for any non Big Data applications is very limited.
As Scott explains in his article, the issue here is in the way scheduling is implemented in YARN:
When a job request comes into the YARN resource manager, YARN evaluates all the resources available, and it places the job. It’s the one making the decision where jobs should go… YARN is optimized for scheduling Hadoop jobs, which are historically (and still typically) batch jobs with long run times. This means that YARN was not designed for long-running services, nor for short-lived interactive queries…, and while it’s possible to have it schedule other kinds of workloads, this is not an ideal model.
A different scheduling model is implemented by Apache Mesos, which:
… uses a two-level scheduling mechanism where resource offers are made to frameworks (applications that run on top of Mesos). The Mesos master node decides how many resources to offer each framework, while each framework determines the resources it accepts and what application to execute on those resources. This method of resource allocation allows near-optimal data locality when sharing a cluster of nodes amongst diverse frameworks.
In reality, both Mesos and YARN have their important place in IT infrastructure. But as Scott explains, when used side-by-side it leads to resource fragmentation.
Using Mesos and YARN in the same data center, to benefit from both resource managers, currently requires that you create two static partitions. Using both would mean that certain resources would be dedicated to Hadoop for YARN to manage and Mesos would get the rest.
As Scott points out, a new project called Myriad, produced by collaboration of eBay, MapR, and Mesosphere, allows YARN and Mesos to work harmoniously for the benefit of the enterprise and the data center.
This open source software project is both a Mesos framework and a YARN scheduler that enables Mesos to manage YARN resource requests. When a job comes into YARN, it will schedule it via the Myriad Scheduler, which will match the request to incoming Mesos resource offers. Mesos, in turn, will pass it on to the Mesos worker nodes. The Mesos nodes will then communicate the request to a Myriad executor which is running the YARN node manager. Myriad launches YARN node managers on Mesos resources, which then communicate to the YARN resource manager what resources are available to them. YARN can then consume the resources as it sees fit. Myriad provides a seamless bridge from the pool of resources available in Mesos to the YARN tasks that want those resources.
Myriad enables the unification of resource utilization and management across a data center using Mesos. YARN workloads, in this case, run on a shared cluster and are more dynamic and elastic compared to a standalone YARN cluster. This approach also makes it easy for a data center operations team to expand resources given to YARN (or, take them away) without ever having to reconfigure it.
Community comments
Any workload can run on YARN not just BigData ones.
by Joseph Niemiec,
Any workload can run on YARN not just BigData ones.
by Joseph Niemiec,
Your message is awaiting moderation. Thank you for participating in the discussion.
I completely disagree with this statement "Although YARN works great for Hadoop clusters, its usage for any non Big Data applications is very limited."
YARN has the capacity to deploy ANY workload you can imagine on top of the Hadoop infrastructure including NON-BigData & NON-Distributed workloads, even ones that require local disk usage. It is a misunderstanding of YARN that leads to believing it can only be used for BigData reasons.
IE - MemcacheD which has pretty much nothing to do with BigData at all can run on YARN along with any other workload imaginable. If you can run it in Docker you can build it for YARN.
Native YARN Apps
hortonworks.com/blog/how-to-deploy-memcached-on...
Using Slider as a Framework for YARN.
slider.incubator.apache.org/docs/slider_specs/h...
Take a look at Apache Slider - "Apache Slider is a YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the application is running." Note that its existing Distributed Applications, not just Big Data applications, and of course you can run non-distributed applications as well with the Slider Framework.