BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Lessons Learned from Scheduling One Million Containers with HashiCorp Nomad

Lessons Learned from Scheduling One Million Containers with HashiCorp Nomad

Bookmarks

The open source release of Docker in March 2013 triggered a major shift in the way in which the software development industry is aspiring to package and deploy modern applications. The creation of many competing, complimentary and supporting container technologies has followed in the wake of Docker, and this has lead to much hype, and some disillusion, around this space. This article series aims to cut through some of this confusion, and explains how containers are actually being used within the enterprise.

This articles series begins with a look into the core technology behind containers and how this is currently being used by developers, and then examines core challenges with deploying containers in the enterprise, such as integrating containerisation into continuous integration and continuous delivery pipelines, and enhancing monitoring to support a changing workload and potential transience. The series concludes with a look to the future of containerisation, and discusses the role unikernels are currently playing within leading-edge organisations.

This InfoQ article is part of the series "Containers in the Real World - Stepping Off the Hype Curve". You can subscribe to receive notifications via RSS.

 

The goal of schedulers such as Nomad, Mesos, and Kubernetes is to allow organizations to deploy more applications at faster rates while increasing resource density. Instead of running just one application per host, a scheduler separates applications from infrastructure so that multiple applications run on each host. This increases the resource utilization across a cluster and saves on infrastructure costs.

An efficient scheduler has three qualities. The first is to be able to service multiple development teams requesting to deploy applications at the same time. The second is to be able to rapidly place those applications across a global infrastructure fleet. Finally, the ability to reschedule applications from failed nodes to healthy nodes, providing cluster-wide application availability management and self-healing capabilities.

HashiCorp's Million Container Challenge is a test for how efficiently its scheduler, Nomad, can schedule one million containers across 5,000 hosts. The goal of the challenge is to observe and optimize the behavior of Nomad at exceptional scale. Modern enterprises run datacenters at unprecedented scale, and it's essential for tools to match the needs of their consumers. By the end of the testing and performance optimizations, Nomad was able to schedule one million containers across 5,000 hosts in Google Cloud in under five minutes. This post outlines the lessons learned scheduling one million containers.

(Click on the image to enlarge it)

Optimistically concurrent schedulers can linearly scale scheduling throughput

Schedulers generally fall into one of three categories: monolithic, offer based, or shared state. Monolithic schedulers have a single, centralized location for scheduling logic, often bound to a single machine. Offer-based schedulers (such as Mesos), also have a single location for scheduling decisions, but can parallelize by offering resources to multiple frameworks that each have their own tasks. Shared state schedulers have multiple locations processing scheduling decisions. Unified state is achieved through use of concurrency, allowing scheduling to be done in parallel.

Nomad is an optimistically concurrent shared state scheduler, meaning all servers participate in making scheduling decisions in parallel. The leader provides additional coordination necessary to do this safely and ensure clients are not oversubscribed. In the million container challenge five Nomad servers ran on n1-standard-32 instances on Google Cloud for a total of 160 CPUs in the Nomad server cluster. The leader runs scheduling workers on half its cores, so in total 140 scheduling decisions could be made in parallel. Parallelism enabled Nomad to schedule approximately 3,750 containers per second for the full five minutes of the challenge.

A note on terminology before explaining Nomad's design: A job is a declaration of work submitted to the scheduler. A job is composed of many tasks. A task is an application to run, which is a Docker container running a simple Go service in this case.

Nomad's optimistically concurrent design is inspired by Google's Omega, which introduced parallelism, shared state, and lock-free optimistic concurrency control to solve challenges they were facing with their earlier generation scheduler, Borg. Parallelism allows a scheduler to scale in terms of both finding placements for large numbers of tasks (container or application) and servicing many development teams. Multiple development teams can submit jobs at the same time, and the scheduler is able to place the jobs' tasks in parallel. For schedulers that cannot do parallel scheduling, when one team submits a job, it blocks other teams from scheduling jobs at the same time. For companies at Google scale in terms of both developers submitting jobs and machines receiving jobs, this type of parallel design is essential. Nomad benefits from existing modern research and brings a state-of-the-art scheduler to the open source community.

(Click on the image to enlarge it)

To learn more about Nomad's scheduling design, read the comprehensive architecture description on the open source site.

The tradeoffs of bin packing versus spread scheduling algorithms

There are two main scheduling algorithms — bin packing and spread. Nomad uses a bin packing algorithm, which means it tries to utilize all of a node's resources before placing tasks on a different node. Conversely, spread algorithms try to keep utilization even across all nodes. The benefit of bin packing is that it allows for a larger variety of workloads and maximizes resource utilization and density which ultimately minimizes infrastructure costs. The benefit of spread algorithms is that it distributes risk and reduces the response and recovery effort if a machine dies. If one machine is only running one or two tasks, finding new placements for those tasks is simple. Conversely, if a machine dies that is running 200 tasks, the scheduler must find new placements for all 200 tasks.

With the bin packing algorithm, as more tasks get placed on a node, that node becomes more attractive for future tasks until the machine’s capacity is reached. That ensures that a node will be fully utilized with tasks before the scheduler moves on to filling up a new node. Additionally, when a task that consumes a large amount of resources needs to be placed, it's easier to find a node with available resources. It is also easier to shut down machines that are not being used.

Simple job specifications make managing job submissions scalable

A scheduler being able to place one million containers is just one part of the challenge. The other challenge is being able to make it easy for developers to submit jobs at scale. Nomad's job specification is declarative and approachable.

In the example below, the "bench-docker-classlogger" job declares that it wants Nomad to schedule 20 instances of the "classlogger_tg_1" task group, which contains the task "classlogger_1". That task is a Docker image that requires 20 MHz of CPU, 15 MB of memory, and 10 MB of disk. When this job is submitted to Nomad, it will schedule 20 instances of those available resources across the cluster of servers, and then run 20 Docker images in those locations. If the count is increased from 20 to 50, Nomad simply schedules 30 more tasks.

The job specification also allows developers to constrain what type of hosts the tasks can be scheduled on. For example, a developer might want to constrain certain tasks to Windows machines, machines of a certain size, or distribute all jobs across distinct hosts. Finally, the restart policy section defines how Nomad should restart tasks that fail. In this example, the Nomad job has a restart governor that will prevent a task from being restarted more than three times within five minutes.

job "bench-docker-classlogger" {
  datacenters = ["us-central1"]

  group "classlogger_tg_1" {
    count = 20

    constraint {
      attribute = "${node.class}"
      value   = "class_1"
    }

    restart {
      mode   = "fail"
      attempts = 3
      interval = "5m"
      delay = "5s"
    }

    task "classlogger_1" {
      driver = "docker"

      config {
        image = "hashicorp/nomad-c1m:0.1"
        network_mode = "host"
      }

      resources {
        cpu = 20
        memory = 15
        disk = 10
      }
    }
  }
}

Importantly, the job specification is declarative. It defines the desired state of the submitted job — the user is expressing that the job should be running, but not where it should be run. The responsibility of Nomad is to make sure the actual state matches the user desired state. This makes maintaining both a large cluster and a large number of submitted jobs much easier, since users don't need reconcile the actual state to match the desired state, as Nomad does that automatically.

Scalability needs to be thought of at both developer and machine levels. At the machine level, Nomad's optimistically concurrent design enables scale. At the developer and organizational level, the simple job specification makes it easy to submit jobs that reach scale.

Service discovery is essential when scheduling containers at scale

Scheduling one million containers across 5,000 hosts is only helpful if you can then discover the location of those containers after they have been placed. Manually updating service configurations at this scale would be logistically impossible, so an automated solution is needed.

Service discovery tools such as Consul by HashiCorp or ZooKeeper allow services to register themselves and their location with a central registry. Other services in the cluster then use this central registry to discover the services they need to connect to. For example, if 50 API tasks are scheduled across 30 hosts, each task registers its location and port with the service registry. Then other services in the cluster can query the registry for the location and port for all 50 instances to be used for configurations and load balancing. If one of the tasks fails, the scheduler could reschedule it in a different location, which then gets updated in the service registry, and that change propagates across the cluster. Or if a host fails, the scheduler needs to change the locations of all tasks running on that host, and all of those tasks needs to update their information in the service registry.

Especially at scale, a scheduler-driven application delivery workflow is highly dynamic. Being able to schedule jobs is just one challenge of application delivery; there also needs to be a robust system for service discovery, health checks, and configuration to ensure the system as a whole stays healthy. Service discovery tools like Consul or ZooKeeper are essential to maintain these dynamic systems.

Schedulers enable scale at the application and organization levels

Several technical improvements were made to Nomad during the testing process. We optimized radix trees to squeeze performance from the scheduler, we added many robustness features to the scheduler such as retry and upper bounds on allocations due to docker crashing, we addressed quirks in libcontainer that came from running a large number of containers on a single node, and we hardened all of these difficult to replicate conditions into Nomad so any user can recreate the test using the newest version of Nomad. Container technology is still young in the marketplace, which is why Nomad supports containerization through VMs, chroot(2), and raw-exec.

Rapid scheduling across infrastructure is helpful beyond just scaling cluster size. Most companies have a number of different teams or organizations, often competing for resources. Traditionally this problem has been solved by creating unique, segmented, clusters for each group. While this does provide isolation and prevents overlap, it usually leads to underutilization of each team’s cluster.

By pooling resources across a single cluster, a company can unify and simplify their infrastructure while also allowing for more flexibility between teams. If one team needs to use more resources at a given time, their workloads can be rapidly expanded onto unused parts of the cluster. Rapid, cluster wide, deployment of resources makes this a possibility.

Being able to schedule one million containers in under five minutes addresses the scalability requirements for the vast majority of enterprises. Nomad will continue to receive performance improvements, but the marginal improvements there will become smaller and smaller. Beyond performance scalability, organizational scalability is the next challenge. Features that address quotas, security, charge backs, and more will make a scheduler workflow easier to adopt in large organizations.

About the Author

Kevin Fishner is the Director of Customer Success at HashiCorp. He has extensive experience working with customers across HashiCorp's open source and commercial products. Philosopher by education (Duke), engineer by trade. @KFishner

 

The open source release of Docker in March 2013 triggered a major shift in the way in which the software development industry is aspiring to package and deploy modern applications. The creation of many competing, complimentary and supporting container technologies has followed in the wake of Docker, and this has lead to much hype, and some disillusion, around this space. This article series aims to cut through some of this confusion, and explains how containers are actually being used within the enterprise.

This articles series begins with a look into the core technology behind containers and how this is currently being used by developers, and then examines core challenges with deploying containers in the enterprise, such as integrating containerisation into continuous integration and continuous delivery pipelines, and enhancing monitoring to support a changing workload and potential transience. The series concludes with a look to the future of containerisation, and discusses the role unikernels are currently playing within leading-edge organisations.

This InfoQ article is part of the series "Containers in the Real World - Stepping Off the Hype Curve". You can subscribe to receive notifications via RSS.

Rate this Article

Adoption
Style

BT