InfoQ Homepage Presentations User Simulation for Rapid Outage Mitigation

User Simulation for Rapid Outage Mitigation

View Presentation

Speed:

24:47

Summary

Carissa Blossom walks through the monitoring service that Uber developed to identify issues in production, and how they leveraged composable integration tests to cut the time to mitigation in half.

Bio

Carissa Blossom is a member of the Production Engineering team for Eats & Delivery at Uber. She is also an Incident Commander for Ring0, Uber Engineering’s primary task force for critical outage mitigation. In her four years at Uber, she has served as Production Engineer for Eats, Software Networking Edge and Marketplace.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Blossom: I'm going to talk about user simulation for rapid outage mitigation. My name is Carissa Blossom. I have been an infrastructure engineer at Uber for over four years, working primarily in the SRE and production reliability space. For more than two years, I have been a commander for Uber's elite outage mitigation volunteer squad. I'm currently working to scale Uber Eats reliably.

The Story of March 30, 2020

Let's start with a story. March 30, 2020, just before 4 p.m. Most of us in San Francisco were busy trying to wrap our heads around our new quarantine lifestyle. That radical shift that most of us thought wouldn't last more than a few months at most. Irina, this week's on-call engineer for Ring0, Uber's elite outage mitigation volunteer squad, had no time for such thoughts. Her phone had just lit up with a page, five cities down in the East Coast U.S. region. Within 10 minutes, all traffic was drained out of the region. Incident mitigated. It would take another 90 minutes for one of the hardware management teams to realize that a thoroughly tested upgrade had caused almost 4000 hosts in the region's data center to go down. The process to recover those hosts would take days. Uber's rides, restaurant, and freight business recovered without significant degradation, in a matter of minutes. Until my telling of this story, nobody outside Uber engineering was any the wiser, all because one woman got one single page and decisively took action to mitigate.

I'd like to tell you that this story is rare, that Uber's engineers are perfect, infallible, and that our systems never make mistakes. This is not the case. I'm not here to talk to you about failure, heroism, or even mitigation tooling. I'm here to talk to you about what comes before that. I'm here to talk to you about that page. To understand the observability solution that Uber developed to efficiently identify broad outages across our applications, you first need a little bit of background on our architecture. Behind Uber is an architecture comprised of over 4000 microservices, as well as 4 monorepos that still involve microservice like individual service deployments. The core flows for the eats and rides apps alone can involve several hundreds of services. The dependency graph between them is quite challenging to navigate.

Multiple Deployment Approaches

At this point in Uber's development, no single person can map out the entire architecture from memory. Adding to this complexity, there is no structure or process around who can deploy what service, or when, besides a strong encouragement that service owners deploy slowly and incrementally across zones, starting with a canary zone, and hopefully proceeded by a staging rollout. In addition to deploy based rollouts, Uber has three different processes for rolling out different features or product configurations. These changes are rolled out not just by engineers, but by operations' team members in cities all around the world. They're rolling out these changes on the city, zone, region, and global levels. This system may seem chaotic, but it's central to our let builders build methods, which allows us to build fast and cater to the unique demands of cities and local governments. Necessary as it may be, it presents a challenge.

Blackbox: Uber's External Monitoring System

How do we identify problems across this broad, interconnected network of services, especially when the issue might be at one of these very points of interconnection? If we relied on standard business metrics to determine when something was wrong with some facet of one of Uber's applications, how would we make sense of that data when the implementation of each product in each city is so different? Even if we could take these business level metrics, and easily make them into alerts, how would they have any significance given all of these different markets? We needed an external monitoring system, one separate from Uber's complex architecture. One that could simulate the experience that users would have, riding, driving, preparing, or eating through an Uber app, in distinct cities all over the world. We'd still have our standard business metrics and service based metrics to back up our new monitoring system. They'd only have to validate and provide further color to problems we'd be much better suited to identify and tackle. Blackbox, as we named it, runs separately on an entirely independent infrastructure stack, leveraging multiple cloud providers to get us as close as possible to the real user's experience.

Test Accounts

How do we anticipate and simulate the pain of our users without actually causing them pain? The first piece of the puzzle is called test accounts. Test accounts are, with a few exceptions, identical to real production accounts. That's intentional. Certain services need to know that an account isn't real, so we don't count or bill users like production accounts or match them with a production user. With these cases handled, all other services essentially treat all accounts, production or test, exactly the same. Special headers, we call tenancy headers, have been added to all accounts, which identify them as production or test, so that services that need to be able to distinguish are able to do so. While almost identical to real production accounts, it is important to note that test accounts are in no way based on real user data. It's all randomly generated. Test accounts by themselves don't actually do anything, they just exist. We needed a system to coordinate the actions of different users as they traverse Uber systems. What we needed were a series of large, extensive integration tests, which could move the test accounts through a real user's experience. Uber's stack is far too complex for someone to be able to effectively write an integration test that covers it all. Besides, there's no real ROI for any one person to take on the task of writing the whole thing themselves, let alone maintaining it, even if they could.

Composable Integration Tests

Bring in the composable testing framework. A given service owner may not know the whole stack, but they're experts on their own services. They know exactly how it fits into the bigger picture. They know what potential states users are in when they interact with the service, and what states they should be leaving in. We leverage this expertise to create a system that works a bit more like Legos. We empowered users to be able to write miniature tests with expected input states for execution that could function as nodes in a series of finite state machines.

Let's say that we own one of the services responsible for the driver's flow, and we want to make sure that drivers properly receive a trip flow offer. We expect the driver to already be online and available before it gets to our test case. We can determine by writing the test what state the user should be in when it finishes our test case. Does he choose to take the offer or not? We leverage the library of test modules available to test our trip offer in the context of a broader scope of the flow, and to even bring in other types of test accounts like rider accounts, to bring a much broader test. In this way, service owners can take everything else for granted and still test their code out in the context of as much or as little of the stuff as they want.

Let's try writing it ourselves. Here, we have a basic struct for a test case written in Golang. In this case, the DriverGetsRequest struct, which takes a data provider which helps us populate the test case with test accounts and whatever other data the test case may need. We define the run function next, which tells the test framework what to do, when it hits the specific test case in a given flow. We expect our driver first to be online before they could receive an offer, but our team doesn't do that. That's ok. We don't actually need to know how that part works. We can just import the driver go online module from the test case library, and then use it to create a bigger integration test, with multiple test cases. With this framework, writing integration tests becomes simple. It just requires each team to keep their Lego pieces or nodes, up to date. They're incentivized to do so. Otherwise, we don't have proper insight into potential outages in their part of the stack.

Composable Integration Tests Use Cases - Write Once, Run Everywhere

To make our testing framework just that little bit more awesome, we wrote it so that it would work everywhere, not just on our external monitoring system, blackbox. We can also run these tests, pre-commit, or pre-deploy via Jenkins. It's also how we simulate peak load. We estimate the number of users on top of current production traffic that we want to run a load test for. We spin up the additive delta of test accounts with whatever configurations we want, and we have them all execute the same integration test. This is how Uber learned to anticipate impending high load days, like New Year's Eve.

Back to Blackbox

We've got all these tests that we can run everywhere with all of their test accounts, and we have the ability to simulate business features. It's all built into blackbox. We configure blackbox to run these integration tests based on all of the unique products, features, and configurations for each city. That's a lot of information about a lot of cities. A key feature of blackbox to solve is the ability to automatically provide a high level assessment of all failures, broken down by different factors associated with the tests that are currently being viewed. It automatically bubbles up the most critical failure information for mitigation to the top. Let's see what this actually looks like.

This is a slightly scrubbed version of blackbox. Here, you can see the failure domains, which refers to the specific test that we are currently looking and gathering all of the information around. In this case, you're looking at Uber's two default tests, trip flow, and eats. These are referring to the impact tests, which encompass the entirety of our core flows for our two major businesses. This is all of the critical features, as well as some additional ones, all in one test. Even as we're looking at these two all-encompassing tests, it might also be helpful to know what other more specialized tests from different teams are failing at this current time. They can provide color and insight into whatever outage we are currently trying to solve.

The next valuable bit of information is this timeline. This is the timeline of failing tests over a given period of time, in this case, the last 15 minutes. The way that this graph manifests can give us a lot of valuable information about what type of system might have rolled out the problematic code or change. For instance, something that was deployed via a configuration flag, which would essentially be an on-off switch, might manifest as a sudden massive spike in failures. Whereas a problematic change rolled out via a deployment system, because deployments are slowly and incrementally rolled out across hosts in a given zone and then region, will also manifest more gradually on this graph here.

Next, you see the failure domain. These are the different attributes that we mentioned earlier, which have been ranked based on the criticality of the particular attribute. In this case, you'll notice that zone has been bubbled up higher than the failure cause. That's intentional, because at Uber we mitigate using zone and region drains for a single service or an entire zone or region. That's actually more important for mitigating an outage than the failure cause, which would refer to the endpoint and the status code, which is more important for actual root cause analysis down the line.

Finally, we see a list of recent test runs shown as a graph. Each of these dots refers to an individual integration test run on what we call a prober. Probers are essentially Docker containers with an identical base image that exists on hosts on distinct cloud providers in regions all over the world. Whether the dot is red or green, it tells you whether or not that particular test run succeeded or failed. By clicking on any of these individual dots, blackbox takes you to a secondary screen, which gives you a significant amount of data about that particular test run, including the full integration test list of endpoints, as well as the error trace and further information.

Blackbox Probers

Let's dig a bit more into those probers. Blackbox was intended to simulate the experience of real users, as closely as possible. That means test accounts on blackbox, need to reflect the diversity of different operating systems running different versions on different carrier networks, just like real users. We refer to the Docker containers where these individual tests are executed, as probers. They run on hosts in different cloud providers in regions all over the world. They report the success or failure of the individual tests to a separate special set of hosts called aggregators.

Aggregators answer the question of, how do we get actionable signal? We don't want too many false alerts or to wake people up at every hiccup or individual prober failure. Instead, we developed probe aggregators, which are a ring of three to five nodes with a master node, which develop consensus so that we have confidence that there is an actual issue. They aggregate the results from the probers, and the master node determines when to raise the alert about a given failure. The additional blackbox has allowed us best-in-class mitigation times. It has changed our approach to on-call management, from one based on solely identifying root cause, and fixing or undoing whatever the issue was, to one where we focus first, on mitigating the outage. We then have the privilege to be able to dig into whatever the underlying root cause is, without any user impact. It empowered engineers, like the on-call engineer, Irina, in our story earlier, to mitigate a hardware outage in a matter of minutes instead of hours.

Jaeger

What do you do when the problem isn't isolated in such a way that a mitigation strategy like a zone or region drain is possible? What do you do when all of your zones or regions are impacted at once? What do you do when there is no viable mitigation strategy? Now that we have significantly reduced our time to mitigation, our next challenge is this time to resolution, which still takes hours if not days. How do we get to failure attribution faster? It turns out that the answer was waiting for us, by pairing machine learning with one of Uber's most famous open source tools, Jaeger tracing. Jaeger is a distributed system that leverages distributed context propagation and transaction monitoring, to provide observability into microservice based systems.

Failure Attribution via Machine Learning with Jaeger

At Uber, we've already got Jaeger tracing across most of our critical services, by creating a new route span at the start of every integration test, by default, as part of the test framework itself, and ensuring that all services on our core flow continue and propagate with further spans. Starting from this root span, we have been able to complete tracing across our integration tests, creating several thousand span long traces, encompassing an integration test traversal through the entire synchronous flow for the eats or rides product. With tracing turned on for all of blackboxes' impact tests, those are those two core tests for eats and rides, we have been able to leverage the frequency of tests run on blackbox to compile significant amounts of data into the success and failure paths for a given test in a given city. By feeding this data into a machine learning model, we are gaining the ability with increasing accuracy to predict, given the blackbox already tells us which endpoint fails, which specific service is most likely responsible for an outage. With this system, we're closer than ever to being able to accurately predict the root cause of an outage right from blackbox.

KAIJU

Let me show you what it looks like. This is a simplified version of our Uber failure attribution system, we call that KAIJU. In this case, we have a simplified version of an integration test, which only has four services, let's say, two or three test cases. KAIJU shows you a map of all of the services for a given endpoint that would have been hit in a success case. Services in this particular case, if it failed, that did not get hit in this particular test run are grayed out. While the service that KAIJU thinks is most likely responsible if there is a failure in the test run at that endpoint, are bolded and covered in a red line. In this case, our first endpoint succeeded. You see a green dot next to the endpoint name, but our second one failed. If we click on that endpoint with a red dot, you'll see a map that looks like this.

One might assume based on the fact that the failing endpoint is, driver go online, that the issue would be with the driver service or the underlying database. Without KAIJU, that would be a very valid assumption to make. As KAIJU shows, that is very wrong. The database never even got touched. What was actually responsible was a very unsuspecting service, city safety service. By helping us to efficiently identify the service of root cause, KAIJU has saved us a significant amount of time that it would take first talking to the driver service team and then to the storage team, before one would even think of checking other parts of the stack. Just in case you prefer the classic Jaeger view, which has a more span based focus, we have a button which will take you to the old-school Jaeger view directly at the top right of the screen.

Recap of Introduced Tools

We've developed test accounts to closely simulate real users. We've created a composable integration testing framework to simulate real user's experience on the ground, and solve the specialization problem that comes from a stack too large for any one engineer to fully comprehend. We've created an external testing and monitoring system to run these tests with these accounts, configured for all the unique city and product specifications one might want to cover. We've created a failure attribution tool built on Jaeger that empowers us to narrow down complex outages to a single potential root cause. Blackbox has enabled teams to develop quickly, but safely. It encourages us to always keep the customer foremost in our consciousness. With blackbox, we are one step closer to the dream of automating the mitigation of outages and providing reliable predictive failure attribution at Uber. With this system in blackbox, we're closer than ever to achieving this goal.

If you'd like to learn more about our stack here at Uber, please check out our engineering blog.

See more presentations with transcripts

Recorded at:

May 16, 2021

Carissa Blossom

InfoQ Software Architects' Newsletter