BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Safe and Fast Deploys at Planet Scale

Safe and Fast Deploys at Planet Scale

Bookmarks
21:19

Summary

Mathias Schwarz discusses the software management, scalability used by Uber, and the need to have these done automatically by software.

Bio

Mathias Schwarz has been an infrastructure engineer at Uber for more than 5 years. He and his team is responsible for the deployment platform for stateless services used across all of Uber engineering. Mathias has a PhD in Computer Science from the Programming Languages group at Aarhus University.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Schwarz: My name is Mathias Schwarz. I'm a software engineer at Uber. I will be talking about safe and fast deploys at planet scale. Uber is a big business. We have several different products. They are, in most cases, deployed to dozens or hundreds of markets all over the world. The biggest of our products is the Uber Rides product that is the product that will take you from somewhere in town to somewhere else with a click of a button. On a daily basis, we do 18 million trips every day. Those are numbers from Q1, 2020. In addition to the trips that happen on the Uber Rides platform, we have Uber Eats for meal delivery, in the list of other products also.

Scale - Infrastructure

To handle all these services, all these products, we have a large set of back-end services. In our case, it's about 4000 different microservices that are deployed across both our own data centers on our own machines or servers, as well as into public clouds such as AWS and GCP. 4000 services make up the backend of the Uber products.

Scale - Deployment

We do 58,000 builds per week, and we roll out 5,000 changes to production on a weekly basis. If you look at those numbers, it means that every one of our backend services on average operated more than once in production every week. It also means that since it takes a while to perform an upgrade of a service, it also means that there's never a point in time where the system isn't undergoing some upgrade of at least one of our backend services. Uber's backend is never in a stable state, we're always rolling stuff out to our production environment at any point in time.

Regions and Zones

When we think of our infrastructure at Uber, we think of it in terms of these layers. At the lowest layer is the individual servers. The servers are the individual machines, the hardware that runs each of the processes. Servers are physically placed within some zone. A zone can either be something that we own ourselves, a data center that we own at Uber, or it can be a cloud zone where the machines are part of the GCP public cloud or AWS. A zone never spans multiple providers. It's always only one provider. A set of zones make up a region. A region is essentially zones that are in close proximity to each other physically so that there's low latency on calls between processes within these zones. This means that you can expect a request to have low latency from one zone to the other. Combined, these regions make up our global infrastructure. When you want to deploy a new build into production, it is basically the process of globally deploying these new builds, all the relevant servers in all zones of the Uber infrastructure.

Early Days: Unstructured Deploys

When we started out on building our deploy strategies and our deploy systems at Uber, we did the same thing as most other companies do. We had each of our service teams, they had a specific set of hosts where they would deploy their new builds to. Whenever they wanted to release a change, they would go to these servers, either manually or they would use a Jenkins script to deploy the build to the servers, and make sure that they had upgraded all of their processes to roll out that new build. That had several drawbacks. One was that if a server failed, it would be a manual process for the team to clean that up. Even worse, if there was a bug in the change that was being rolled out, it would mean that the team would have to clean that up and get the system back to a good state after getting their bad change out of the production system.

Important Deploy System Features

About 2014, we took a step back and we began thinking about what would it take to create a deploy system that will automate all these issues and make it easier for our engineers to keep deploying at a high frequency, but to also, at the same time, make it safe. We came up with a list of requirements of things we wanted the system to be able to do. We wanted our builds to be consistent.

We wanted the builds to look the same, regardless what language was being used, regardless what framework was being used, and regardless what team was building the service. The build should look the same to the deploy system to make it easier to manage them. We wanted all deploys to be zero downtime. Meaning that when as an engineer you want to roll out your build, you want the system to automatically manage the order of the rollout to the servers. You also want the system to make sure to not stop more processes than it is able to without interfering with the traffic that goes into the service. We also wanted to make outage prevention a first class citizen of this system. Essentially, mainly we wanted the system to be able to discover issues if there were any issues when we rolled out a new build to production. We also wanted the system to be able to get our backend back to a good state. Overall, the idea was that this would let our engineers be able to simply push out new changes and trust the system to take care of the safety of those deploys.

Structured Deploys With uDeploy

Based on these requirements, based on these principles, we started building the Micro Deploy system. Micro Deploy went live in 2014. Over the course of that year, we moved all our backend services to that new platform. In Micro Deploy, we made all our builds be Docker images. We did that using a combination of a build system called Makisu, plus a build management system called Hugo that we built internally. Essentially, these two systems combined meant that all our Docker images looked the same and would behave the same to the deploy system, simplifying management of deploys quite significantly.

Deploy Into Cluster in Zones

We also changed the level of abstraction for our engineers. Instead of having to worry about the individual servers to deploy to, we told them to just tell us which zones and what capacity they wanted in each of those zones. This meant that instead of asking the engineer to find specific servers, we had capacity in these zones and we would then deploy into that zone. Whenever there was a server failing, we would replace that and the service would be moved to these new servers without any human involvement. We did that in uDeploy, using a combination of the open source cluster management system called Mesos, plus a stateless workload scheduler called Peloton that we built internally at Uber, and made it open source. Today you can achieve something similar, do the same thing basically using Kubernetes.

Safety - Monitoring Metrics

We also decided to build safety mechanisms directly into the deploy platform to make our deploys as safe as possible. One thing that we built into the deploy platform is our monitoring system, uMonitor. All our services at one time emit metrics that are ingested by uMonitor. uMonitor continuously monitors these metrics in time series and makes sure that the metrics do not go outside some predefined threshold. If we do see the database metrics break these predefined thresholds, we will initiate a rollback to a safe state, and that will happen automatically in the Micro Deploy system. It caches the previous state of the system, and then when the rollback is initiated, Micro Deploy automatically gets the service back to its old state.

Safety - Whitebox Integration

Also, for our most important services at Uber, we have whitebox integration testing. We use a system that we developed internally called Hailstorm, which essentially, when you roll out the first instances to a new zone, it will run load tests for these specific instances in production, and it will run whitebox integration tests. Essentially, integration tests that hit the API endpoints of the service and make sure that the API still behaves as we expect it to. By doing this on the first few instances that roll out to a zone, we can discover issues in production before they hit more than a few of our hosts. We can also roll back to the previously known safe state for the service as a result of some of these tests failing.

Safety - Continuous Blackbox

Finally, we have built what we call blackbox testing. Blackbox testing is essentially virtual trips that are happening continuously in all the cities where the Uber products are live. Blackbox takes these virtual trips, and if we see that you cannot take a trip in one of those cities, then there'll be a page to an engineer. This engineer will then manually have to decide whether to roll back or whether to continue the rollout. He'll also have to decide which services could have caused the trips suddenly on the platform to start hitting. That is how the last result rollback together we have here. Micro Deploy gave us that safety at scale. It also gave us availability of services, despite individual servers failing. What we discovered a couple of years ago, was that we were spending an increasingly large amount of our engineering time managing services. Engineers still had to figure out in which zones to place a service. Would they want to place a service on AWS or on our own data centers? How many instances would they need, and so on? Service management was still a very manual task two years ago.

Efficiency at Scale

We took a step back again, and thought, what would it take to try to build that system that could automate all of these daily tasks for our engineers and make sure that the platform could basically manage itself? We came up with three principles that we wanted to build into our system. We wanted it to be truly multi-cloud, meaning that we didn't want there to be any difference for our engineers, whether the service ran on a host or server that we owned ourselves in one of our own data centers, or in one of the public clouds. It shouldn't matter. We should just be able to deploy anywhere without any hassle. We also wanted it to be fully managed, meaning that we wanted the engineers to only worry about making their changes and make sure that these changes work, and roll them out to production. We no longer want them to handle placement in zones, or handle scaling of the services, or any other of these manual management tasks. Even while we were building that, we still wanted the systems to be predictable. We still wanted the engineers to be able to understand and to be able to predict what would happen to their service. Even if we were deciding on changing the scaling or moving them to a cloud zone, we wanted to tell our engineers what was going on and why was that happening.

Efficient Infrastructure with Up

Based on these three principles, we started building our current deploy platform at Uber which is called Up. In Up, we took another step up in terms of level of abstraction for our engineers to care about when they managed their services and rolled out their changes. Instead of asking them to care about individual zones, we just asked them about what their requirements are on physical placement in terms of regions. As an engineer using Up, I'd just ask for my service to be deployed into a specific region, and Up would then take care of the rest.

That looks something like this to our engineers today. This is the Up system. We zoom in on the status bar here, the video device recorded. We can see that this service is deployed into a canary, and it's deployed into two different regions. These two regions in this case are called DCA and PHX. We're not telling our engineers whether the physical servers run in cloud zones or whether they run in our own data centers. We're just telling them that there is these two regions, and this is why they are available and this is how many instances they have in these two regions.

When the engineer does a rollout to production, he sees something that is either a rollout or when the system decides on making changes by services, a plane like this. The plane consists of steps that have already been performed, so you can see what has happened so far for this service. What is currently happening? For example, which zone are we currently upgrading for this service and how far are we in that upgrade? Then finally, there's a list of changes that will be applied later, after the current change has been applied. This means that the engineer can see what has happened, he can see what is going on right now, and he can see what the system will do later in this rollout. It's completely predictable what's going to happen over the course of this rollout.

Let's Add a New Zone

One thing that we wanted the Up system to solve for us was to basically make our backend engineers not care about the infrastructure, the topology of the underlying infrastructure. One thing that we wanted to achieve here was to be able to, for example, add a new zone. If I have my regions here, and I'm adding a new zone to an existing region, it looks something like this. The infrastructure team will set up the physical servers. They will set up storage capacity, and they will connect the zone physically to the existing infrastructure. The challenge here is now to get 4000 service owners and 4000 services, at least a fraction of them move to the new zone to actually use this new capacity that we have in that zone. Before Up, it would take 60 engineers roughly 2 months to complete a new zone stand-up. That was quite a lot of time. We wanted Up to automate that for us.

Declarative Configuration

Let's say we had a new zone here, the engineers only configure their capacity in terms of regions, so in terms of physical placement in the world. They will tell us that they want 250 instances in the DCA region, and 250 instances in the PHX region, and they can tell the deploy system some basic stuff about their dependencies to other services and whether they want to use canary for their service. It then becomes Up's responsibility to look at this configuration and to continuously evaluate whether the current placement and the current configuration of the service is the right one. Up will continuously compare the current topology of the infrastructure to these abstract declarative service configurations, and figure out how to manage this, or how to place this service optimally.

With this configuration, and with this continuous evaluation loop, what happens in our system when we add a new zone? The Up system will automatically discover that there is a better placement, there is a new zone available that has much more capacity available than the existing zones has for some specific service, for example this one called publican. Then after evaluating these constraints, and deciding that there's a better placement, we have what we call compute balance in this case, that will kick off a migration from one zone, within that region to another zone. Those 60 engineers no longer have to spend their time manually moving these services since our compute balancer bot does that for them automatically.

Summary

That was our journey from a small scale company where engineers managed individual servers to this zone-based abstraction of Micro Deploy, where we could automatically manage servers. Service management overall was still a task for our engineers to maintain on a daily basis. Then finally, to our new Up system where essentially we have automated everything at the regional level. You can safely roll out 5,000 times to production per week, and you can manage a system of such a crazy scale as the Uber backend has become. The key to getting this to work in practice is automation, and it's abstractions at a level that allow you to perform the management tasks that the engineers would otherwise have had to perform manually. This means that we can leave service management completely up to machines, both in terms of placements, in terms of deciding on host providers, and in terms of deciding on scaling of the service.

 

See more presentations with transcripts

 

Recorded at:

May 21, 2021

BT