BT

Facilitating the spread of knowledge and innovation in professional software development

Contribute

Topics

Choose your language

InfoQ Homepage Articles Safe and Fast Deploys at Planet Scale

Safe and Fast Deploys at Planet Scale

Bookmarks

Key Takeaways

  • Uber’s infrastructure platform lets thousands of engineers make changes to the system in parallel without sacrificing stability
  • We scaled the system originally and gradually raised the level of abstraction from individual hosts to regions of multiple zones as the business grew.
  • Abstraction of physical hosts and zones from daily operations lets us greatly reduce the friction incurred when running stateless services at Uber.
  • Automation of the rollout process beyond the simple, happy case is key to large-scale automation of stateless infrastructure overall.
  • Unification into a single, managed control plane greatly improves Uber’s ability to manage stateless workloads across many availability zones efficiently.

At QCon Plus, Mathias Schwarz, a software engineer at Uber, presented safe and fast deploys at planet scale. Uber is a big business and has several different products. They are, in most cases, deployed to dozens or hundreds of markets all over the world. The biggest of our products is the Uber Rides product that will take you from somewhere in town to somewhere else with a click of a button. Daily, Uber makes 18 million trips every day – and those are numbers from Q1, 2020. In addition to the trips on the Uber Rides platform, they have Uber Eats for meal delivery in the list of other products.

Scale - Infrastructure

To handle all these products, Uber has a large set of backend services. In their case, it's about 4000 different microservices deployed across machines in several data centers.

Scale - Deployment

At Uber we do 58,000 builds per week and roll out 5,000 changes to production every week. So if you look at those numbers, it means that every one of Uber's backend services, on average, is deployed more than once in production every week.

Since it takes a while to perform an upgrade of a service, it also means that there's never a point in time where the system isn't undergoing some upgrade of at least one of our backend services.

Regions and Zones

When we think of our infrastructure at Uber, we think of it in terms of these layers. At the lowest layer are the individual servers. The servers are the individual machines, the hardware that runs each of the processes. Servers are physically placed within some zone. A zone can either be something we own ourselves, a data center at Uber, or a cloud zone where the machines are part of the GCP public cloud or AWS.

A zone never spans multiple providers - it's always only one provider, and a set of zones makes up a region. A region is essentially zones that are physically close to each other so that there's low latency on calls between processes within these zones, which means that you can expect a request to have low latency from one zone to the other. Combined, these regions make up our global infrastructure. So when you want to deploy a new build into production, it is basically the process of globally deploying these new builds, to all the relevant servers in all zones of the Uber infrastructure.

Early Days: Unstructured Deploys

When Uber started building their deploy strategies and deploy systems, it started out the  same way as most other companies. Each of Uber’s service teams had a specific set of hosts where they would deploy their new builds. Then, whenever they wanted to release a change, they would go to these servers, either manually or use a Jenkins script to deploy the build to the servers and make sure that they had upgraded all of their processes to roll out that new build. However, this approach had several drawbacks. For instance, it would be a manual process for the team to clean up when a server failed. Even worse, if there were a bug in the change that was being rolled out, it would mean that the team would have to clean that up and get the system back to a good state after getting their bad change out of the production system.

Important Deploy System Features

In 2014, we took a step back and began thinking about what it would take to create a deploy system that will automate all these operations and make it easier for our engineers to keep deploying at a high frequency, but also, at the same time, make it safe. We at Uber came up with a list of requirements of things we wanted the system to be able to do. We wanted our builds to be consistent; moreover, we also:

  • Wanted the builds to look the same, regardless of what language was used, what framework was used, and what team was building the service. The build should look the same to the deploy system to make it easier to manage them.
  • In addition, we wanted all deploys to have zero downtime, which means that when you want to roll out your build, you want the system to manage the rollout order to the servers automatically. We wanted the system to make sure not to stop more processes than it can without interfering with the traffic that goes into the service.
  • Wanted to make outage prevention a first-class citizen of this system. Essentially, we wanted the system to be able to discover and respond to issues if there were any issues when we rolled out a new build to production.
  • Finally, we wanted the system to be able to get our backend back to a good state. Overall, the idea was that this would let our engineers simply push out new changes and trust the system to take care of the safety of those deploys.

Structured Deploys With uDeploy

Based on these requirements, we at Uber started building the Micro Deploy system. Micro Deploy went live in 2014. Over that year, we moved all our backend services to that new platform. In Micro Deploy, we made all our builds be Docker images. We did that using a combination of a build system called Makisu that we built internally. Essentially, these two systems combined meant that all our Docker images looked the same and would behave the same to the deployed system, simplifying management of deploys quite significantly.

Deploy Into Cluster in Zones

At Uber, we also changed the level of abstraction for our engineers. Instead of worrying about the individual servers to deploy to, we told them to tell us which zones and what capacity they wanted in each of those zones. So instead of asking the engineer to find specific servers, we had capacity in these zones. We would then deploy into that zone. Whenever there was a server failure, we would replace that, and the service would be moved to these new servers without any human involvement. We did that in uDeploy, using a combination of the open-source cluster management system called Mesos, plus a stateless workload scheduler called Peloton that we built internally at Uber, and made it open source. Today you can achieve something similar using Kubernetes.

Safety - Monitoring Metrics

We also decided to build safety mechanisms directly into the deployed platform to make our deploys as safe as possible. One thing that we built into the deployed platform is our monitoring system, uMonitor. All our services emit metrics that are ingested by uMonitor. uMonitor continuously monitors these metrics in time series and makes sure that the metrics do not go outside some predefined threshold. If we see the database metrics break these predefined thresholds, we will initiate a rollback to a safe state, which will automatically happen in the Micro Deploy system. Micro Deploy captures the system's previous state, and then when the rollback is initiated, Micro Deploy automatically gets the service back to its old state.

Safety - Whitebox Integration

Also, for our most important services at Uber, we have Whitebox integration testing. We use a system that we developed internally called Hailstorm. When you roll out the first instances to a new zone, it will run load tests for these specific instances in production and run Whitebox integration and load tests. This happens in addition to large sets of integration tests that are run prior to landing the code.

These integration tests hit the API endpoints of the deployed service and make sure that the API still behaves as we expect. By doing this on the first few instances that roll out to a zone, we can discover issues in production before they hit more than a few of our hosts. We can also roll back to the previously known safe state for the service due to some of these tests failing.

Safety - Continuous Blackbox

Finally, we have built what we call BlackBox testing. Blackbox testing is essentially virtual trips happening continuously in all the cities where Uber products are live. Blackbox takes these virtual trips, and if we see that you cannot take a trip in one of those cities, then there'll be a page to an engineer. This engineer will then manually have to decide whether to roll back or whether to continue the rollout. They'll also have to determine which services could have caused the trips suddenly on the platform to start seeing issues. So BlackBox testing is the last resort issue detection mechanism that we run.

Micro Deploy gave us safety at scale. It also gave us the availability of services, despite individual servers failing. A couple of years ago, we discovered that we were spending an increasingly large amount of our engineering time managing services. Engineers still had to figure out in which zones to place a service. Would they for example want to put a service on AWS or on our own data centers? How many instances would they need, and so on? Service management was still a very manual task two years ago.

Efficiency at Scale

Hence, we took a step back again and thought, what would it take to build that system that could automate all of these daily tasks for our engineers and ensure that the platform could manage itself?

We came up with three principles that we wanted to build into our system:

  1. First, we wanted it to be genuinely multi-cloud, meaning that we didn't want any difference for our engineers, whether the service ran on a host or server that we owned ourselves in one of our own data centers or on one of the public clouds. It shouldn't matter. We should be able to deploy anywhere without any hassle.
  2. Secondly, we also wanted it to be fully managed, meaning that we want the engineers to only worry about making their changes, making sure that these changes work, and rolling them out to production. We no longer wanted them to handle placement in zones, scaling the services, or other manual management tasks. At the same time, we still wanted the deploy system behavior to be predictable.
  3. And finally, we still wanted the engineers to understand and predict what would happen to their service. So even if we decided to change the scaling or move them to a cloud zone, we wanted to tell our engineers what was happening and why.

Efficient Infrastructure with Up

Based on these three principles, we started building our current deploy platform at Uber, called Up. In Up, we took another step up in terms of the level of abstraction for our engineers to care about when they managed their services and rolled out their changes. So, for example, instead of asking them to care about individual zones, we asked them about their physical placement requirements in terms of regions. So as an engineer using Up, I'd ask for my service to be deployed into a specific region, and Up would then take care of the rest. That looks something like it is shown below to our engineers today.

We can see that this service is deployed into a canary and deployed into two different regions, which are, in this case, called "DCA" and "PHX". We're not telling our engineers whether the physical servers run in cloud zones or whether they run in our own data centers. We're just telling them that there are these two regions, and this is how many instances they have in these two regions.

When the engineer does a rollout to production, they see a plan like this when the system decides on making changes by services. The plan lists the steps that have already been performed, so you can see what has happened so far for this service. Second, it shows what is currently happening. For example, which zone are we currently upgrading for this service, and how far are we in that upgrade? Then finally, there's a list of changes that will be applied later, after the current change has been applied - which means that it's entirely predictable for the engineer what's going to happen throughout this rollout.

Let's Add a New Zone

The one thing that we wanted the Up system to solve for us was to make our backend engineers not care about the infrastructure, and specifically the topology of the underlying infrastructure. As a specific goal, we wanted it to not matter to engineers if we add or remove a zone. If I have my regions here, and I'm adding a new zone to an existing region, it looks like the drawing below.


 
The infrastructure team will set up the physical servers, set up storage capacity, and connect the zone physically to the existing infrastructure. The challenge is now to get 4000 service owners and 4000 services, or at least a fraction of them moved to the new zone to use this new capacity that we have in that zone. Before Up, it would be a highly manual process involving dozens of engineers to complete a new zone stand-up. It was quite time consuming so we wanted Up to automate that for us.

Declarative Configuration

Let's say we had a new zone as described earlier; then, the engineers will only configure their capacity in terms of regions and physical placement in the world. They will tell us that they want 250 instances in the DCA region and 250 instances in the PHX region. Moreover, they can tell the deploy system some basic stuff about their dependencies to other services and whether they want to use canary for their service. It then becomes Up's responsibility to look at this configuration and continuously evaluate whether the current placement and the current configuration of the service is the right one for each. Up will constantly compare the current topology of the infrastructure to these declarative service configurations and figure out how to place this service optimally.

With this configuration and continuous evaluation loop, what happens in our system when we add a new zone? First, the Up system will automatically discover a better placement; there is a new zone available with much more capacity than the existing zones for some specific service. Then after evaluating these constraints and deciding that there's a better placement, we have Balancer that will kick off migration from one zone within that region to another zone. Engineers no longer have to spend their time manually moving these services since our Balancer does that for them automatically.

Summary

This article told you about our journey from a small-scale company where engineers managed individual servers to the zone-based abstraction of Micro Deploy, where we could automatically manage servers. Service management overall was still a task for our engineers to maintain daily. Then finally, to our new Up system, where we have automated everything at the regional level. You can safely roll out 5,000 times to production per week, and you can manage a system of such a crazy scale as the Uber backend has become. The key to getting this to work in practice is automation. Its abstractions at a level that allows you to perform the management tasks that the engineers would otherwise have had to perform manually. This means that we can leave service management entirely up to machines, both in terms of placements, deciding on host providers, and scaling the service.

About the Author

Mathias Schwarz has been an infrastructure engineer at Uber for more than 5 years. He and his team is responsible for the deployment platform for stateless services used across all of Uber engineering. Mathias has a PhD in Computer Science from the Programming Languages group at Aarhus University.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Debugging

    by Benjamin Barnes,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hey, thanks for the writeup.

    Hello!

    I work at Amazon and the larger structure of your teams seems a little different than us. We don't have teams that manage the infrastructure and deployment pipelines - that is owned by software engineers themselves. The approach at Uber seems like it would let engineers focus on building new things instead of debugging deployments.

    I think I have two groups of questions for you:
    - How does Uber handling debugging large scale production issues? Say there is an unexpected anolamy with the hosts that requires service/domain knowledge. At Amazon, engineers would have setup those hosts and worked with them, so are able to SSH and debug as necessary when required. How is that ownership split at uber with these two teams operating on the service? Do your infra engineers have domain knowledge and can help debug these sort of issues? Or do you just page the SDE and have them deep dive?

    - Another question I have is about connecting to these hosts as an SDE. While obviously its best to never manually touch production, sometimes that is absolutely required, if even to just look at logs. How do your engineers connect with the hosts when they are spread across different platforms, such as GCP and AWS? I assume they have different keys and the interfaces are different, etc. Would be intersting to hear more about this.

    Anyways, thanks for sharing - its always intersting to hear how other companies are operating!

    Ben

  • Re: Debugging

    by Mathias Schwarz,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Ben,

    our engineers do not generally have host access in production and our stateless service hosts are all set up the same way so that any service may run on any host: They use the same OS version, the same stack of Uber Infra host software etc. We pull all debug information out of the containers so that it can be inspected externally. Specifically, this means that logs are transferred out via Kafka and indexed so that they can be queried, stats are propagated to a central store (M3) where they can queried etc.
    In the rare case where production access is required, engineers can SSH via a proxy directly to their service container to run commands within the container itself (some services are even restricted from this though). To the engineer, this interface will look the same regardless of the cloud provider so they will not need to access things if a different way if we move the service to a new zone. This will generally allow the service engineers to debug their service independently of infra engineers.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT