InfoQ Homepage Presentations Prod Lessons - Deployment Validation and Graceful Degradation

Prod Lessons - Deployment Validation and Graceful Degradation

Bookmarks

View Presentation

Speed:

38:11

Summary

Anika Mukherji discusses lessons learned in production at Pinterest: deployment validation framework and product-informed graceful degradation, preventing hundreds of outages.

Bio

Anika Mukherji is a senior SRE at Pinterest's HQ in San Francisco. She is embedded in several teams, including the API platform team, the web platform team, the traffic team and the continuous delivery team. She focuses on making the core "Pinner" experience reliable and measurable, with a special emphasis on safe production changes.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Mukherji: My name is Anika Mukherji. I'm going to be presenting on lessons from production for SRE. I'm going to be talking about two particular projects or learnings from my time at Pinterest that dramatically reduced the number of incidents that we saw. The strategies and approaches that I'm going to talk about helped us prevent incidents in a very systematic way. I think other people will find it a useful way to think about solving problems.

I've been at Pinterest for almost three years now, and within SRE at Pinterest for almost two years. Within SRE, I am primarily focused on the Pinner experience. I'm the SRE embedded in our Core Pinner API team, our Core Pinner web team, and our traffic infrastructure team. When I think about incidents and solving problems, I'm really focused on the user experience and how the Pinners interact with our product.

Nature of Incidents

First, I want to talk a little bit about the nature of incidents in general. Why do we have incidents? The reason we have incidents is because changes are introduced into the system by people, and people make mistakes. That means that as the number of people grow, at our company, the number of human errors are going to grow as well, meaning we're going to have more incidents. That is a natural consequence of growth, of product development, of the complexity of our systems. You're going to see an increase in incidents as the company grows. It's not going to be quite so linear as the dotted line here. It's probably going to be more of a curve. It's going to be a trend upwards. When thinking about tracking incidents, we really want to be thinking about how do we reduce the number of incidents per people or per team, or however you want to have that denominator. How do we flatten this curve?

Key Insights

The two key insights that are going to correspond directly to the two initiatives that I talk about today, are making changes safely. In other words, safe deployment practices. How do we take the changes that engineers have made locally and safely deploy them to our end user in a way that helps us catch errors proactively, and reduce toil for engineers themselves? Then the other side of the line, we have make the right thing the easy thing. What I mean by this is developers generally want to develop safely. Engineers want to do the right thing. We don't want to develop in an unsafe or unreliable way. As SREs, one big lever we have is to try to make the default experience for developers the safe option. How do we help them do the right thing by making it the easiest thing?

Making Deployments Safe

Onto the first section here, making deployments safe. At Pinterest, we recently adopted, and have been working on something called ACA or Automated Canary Analysis. The basic concept of canary analysis is that you have a control fleet of a certain size, takes a small percentage of production traffic, and you have a control fleet that's serving the current production build, which is a verified healthy build. These two fleets are receiving equal amount of traffic, equal distribution of traffic. They're identical in all respects, except for the build that they're serving. The canary is like the experimental treatment, and the control is the control. Because these two fleets have exactly the same traffic patterns, the exact same configuration, we are really able to compare metrics in an apples to apples way between these two servers. How this works is you can build a canary control comparison into your deployment practice. Before you roll something out to production, you would run canary control comparison as a required step, and it has to pass that validation before we would deploy to production. There's quite a few blog posts from Netflix about this actually.

ACA at Pinterest

At Pinterest, we employ ACA by using Spinnaker, which is also a Netflix maintained open source tool. Spinnaker serves to orchestrate our deploys. It essentially allows us to create a deployment DAG, where we can deploy to certain stages in a certain order and run jobs or validation in between those deployments. Separately, we have an observability framework that allows anyone to set up dashboards and graphs to help them monitor the health of their system. This is pretty standard at a lot of companies. Maybe it's Datadog, maybe it's some other third party tool. We just have something in-house that contains all of our metrics, and allows engineers to set up graphs and set alerts on those graphs.

When we were thinking about canary analysis, we wanted to not have to build a new metrics framework. We wanted to be able to use the entire suite of metrics that we already have at our disposal. The way we went about this was setting up a configuration system where engineers could set up a config and code in YAML, that had a list of dashboards. Those dashboards have sets of graphs that allow engineers to measure different things like success rate, CPU. The YAML also has other configuration like duration, fast fail threshold, just different things about the validation that we want to run. After committing this YAML, it essentially exposes an endpoint. This endpoint allows us to query for whether any of the graphs on those dashboards that were specified in the YAML are alerting, basically have hit some critical threshold. Or in other words, were failing validation.

What we did in Spinnaker is that we have one Spinnaker node on the stack that actually queries this API and polls it for duration, the length of the duration has been specified. If at any point, this API returns that any of the graphs are in critical mode or the threshold has been hit, then we'll fail validation. If we fail validation, we'll basically page somebody, and then we'll enter a manual judgment set. When we're in manual judgment, this means that our pipeline is paused, and somebody has been paged and is actively looking into the problem. Then they can either decide that, yes, this is a true positive and this is a problem, in which case in the manual judgment stage, they will make a selection that's either something like roll back or pause the pipeline completely, so we can look into it more, some defensive action. Or they'll say, this is a false positive, a noisy alert, we need to fix this asynchronously. Then they'll promote the build to production.

As a result of this, we've basically been able to go from a system where we're using staging and deploy directly to production, and we have a question mark. We have no idea if that build is healthy. Then we need to roll back all of production, causing bad user experiences, to a system where staging will then deploy to canary, will run metric validation. Will take some defensive action, and go back to staging and make some fixes, before rolling out to production. That way, we have a lot more confidence that the build we're deploying to production is actually healthy. The other thing that this helps us do is have a much quicker time to recovery, MTTR, because production is very big. For example, our Core Pinner fleet is thousands of hosts. Canary is only around 20 to 30. Rolling back canary or deploying to canary, deploying fixes to canary, is much quicker than it is to roll back all of production.

Then we risk that the fix that we've put in place might not even be the right fix. The other big benefit is that after we've detected a problem in canary and we take the defensive action and put out a fix, whether that's a revert, a fix forward, we can use the canary metrics again to validate that the problem has been fixed as we'd expect. This not only improves user experience dramatically, but also reduces the amount of engineering time that's spent on mitigating production incidents and rolling back big fleets, which can take quite a while. At Pinterest, this system has saved us tons of hours in incident time. It also has reduced the number of production incidents that we've seen by over 30, a half, at least the last time I checked. We've seen really big return on investment from this initiative, or investing in deployment practices.

Systemic Graceful Degradation

Next, I'm going to talk a little bit about making the safe option the easiest option. I'm going to talk about a very specific problem that I was presented when I first became an SRE for our Core Pinner API. Our Core Pinner API is a REST API. Essentially, the clients will make a request to the API and the API will return some set of models, pin models, board models, user models. By default, those models have a very basic set of fields on them, basic set of metadata on them. The clients obviously are product rich, and they sometimes need extra data from the API. Maybe something about shopping metadata, or maybe some extra links, things like that. The way we do this is the client can specify in the query string, I want this number of fields comma separated. Those fields are then hydrated through batch fetching functions that we call field dependency functions. They're very similar to Data Loader functions, if you're familiar with GraphQL. These Data Loader functions were essentially a huge fanout to all sorts of different services in our infrastructure, which all had varying availability guarantees, which was the crux of the issue. Because we would see a small outage or any outage on any of these upstream systems, which would then propagate this error up to the API. We had no graceful degradation there. We had no error handling. We would return errors for basically all of Pinterest. We had tons of outages because of this specific piece of infrastructure.

Over time, we saw this pattern in the incidents that we were experiencing. We sat down and we thought about, how can we think about this in a smarter way? Because not all of the data that we were returning to the client was critical, in that the client might request some shopping metadata or comment information about a pin. That's not actually extremely critical to the Pinner experience, in that we could gracefully degrade that information if it's not available, and still give the user a reasonable Pinterest experience.

The way we thought about this was looking at systemic graceful degradation. One of the big realizations here was that most engineers copy and paste their code. That if you have a common code structure, a lot of the times, engineers, myself included, we are going to just copy something that we know works. Change the function signature, and then change the logic inside until it does what we want it to do. This was exactly what was happening in this function, there were just tons of functions that did follow the same exact pattern, but were just all written in a very unsafe way. The approach we took was creating a standard decorator, this was written in Python, that did the error handling for us. We wrote this field dependency decorator that you can see in this screenshot that took in the data type, that was the expected return type. Within this decorator, did all sorts of error handling. If we saw any drift exception returned from this function, we would gracefully degrade it and return an empty data structure based on the provided parameter. We made it possible to opt out of this. There's another argument that you can put in this field dependency decorator that allows you to opt out of it. The fact that the function signature is so concise, makes it really nice for people to copy and paste it.

Of course, we needed to backfill everything, so we went through all of the field dependencies that already existed, upwards of 100 of them. I worked with all the client engineers and figured out, what is critical data? What's necessary to the Pinterest product? Then, what is auxiliary? What can we gracefully degrade? We found that most stuff we could really gracefully degrade. There were just a few things that either broke the Pinterest product in a really bad way, like images, for example, or where the client would actually crash because of missing data. Client crashes are the worst user experience that we can provide so we're trying to avoid those at all cost.

After doing this exercise with all the client teams. We marked everything with this dependency decorator. After that, we had no problems. Basically, everybody will copy and paste something that already existed, change the arguments, and get all the error handling for free. As a bonus, we also instrumented a set of metrics within this decorator that gave us insight into things that we had no idea about before: latency, success rate, QPS, what endpoints are actually using this data. Which give us tons of useful information into how these were actually being used, and allowed us to monitor it in a much more sustainable way. We also took inspiration from this idea and added ownership to all these functions, because they all fetch some data. Then we could understand what parts of the product which are owned by what teams at Pinterest, own this data, so that we can mark it accordingly, and then tag our metrics. Then escalate to the correct team accordingly.

As a result of all this, we were able to prevent tons of incidents. We have metrics on all the errors that are gracefully degraded, and all errors that are actually raised, we have prevented hundreds of incidents, and thousands of days of user downtime as a result of this effort. Another thing I should mention that we also did here was we did some type checking as well to really enforce users to use this decorator. In that we made sure that every Data Loader function that was getting mapped to a field was wrapped in this field dependency decorator. We did that by setting an attribute on the function object itself, so that way we could do some checking, so that if a user did not wrap it in the field dependency decorator, then it would actually fail at build time. Any time that we can take a common mistake that people make, and check it, and be able to make sure that they can't make that mistake at the time when they're actually writing the code and submitting the PR, that's where we'll gain tons of leverage over preventing incidents.

Key Takeaways

Key takeaways from these two initiatives. These two projects alone have prevented hundreds of incidents a year for Pinterest. Particularly investing in safe deployment practices really pays off. It might not seem like it at first, because writing good canary analysis or writing good metric analysis is hard. Especially when it comes to low QPS metrics and things like that, it can be very difficult to write the correct alert. There will be toil, there will be iteration in terms of noisy alerts, false positives, getting the framework set up correctly. Over time, the investment will 100% pay off. The way we identified these two projects was by looking at trends in incidents. We really looked at the incidents that Pinterest was seeing and saw that, a lot of these are induced by deployments. A lot of these are related to the Data Loader functions. For SRE, a big recommendation that I have is to really take the time and dig into the incident data and categorize it by root cause, by responsible component, by team, like all of the axes, really. You want to get cuts into the data that help you identify where the biggest return on investment is. Because you may go to an incident post-mortem, and this incident maybe only happens once a year, and you have all of these remediation items as a result, and it really may not be worth the investment. If you look over time, you might be able to find repeated things that are every couple months, and that's where you can really get your time back in terms of the work that you're doing.

Then, lastly, I've said this a couple times, but making the paved path the easiest path. Whether this is code in your framework itself, and abstractions and things that the framework is providing to help people write their code, or whether it's runbooks, whether it is scripts that helps them set up capacity, configurations. Any time you can remove manual labor, because any time things are being done manual, mistakes will be made. Any time you can script stuff, you can automate it, you can build it into the framework, and make that part of the paved obvious path, that's where you're going to get that return on investment for the project. Especially prevent incidents, and save engineering time. That's where you'll have the most success.

What Canary and Control Serve

Canary and control are serving legitimate production requests. It's basically just a separate piece of capacity, a separate ASG. We keep it static so it doesn't autoscale, like our production fleet does. We keep it large enough that we get a strong enough signal. It's just part of the regular production server set, so requests are routed there with some probability, depending on how big canary and control are. It serves read and write requests. The reason that we allow it to serve read and write requests is because it's gone already through preliminary testing through extensive unit tests, and integration tests. It's gone through the code review process and everything, so we feel confident that the build can serve production requests at that point. The reason that we don't do shadow traffic, which I agree would be awesome, is because we don't have a great way of knowing what is a read request, and what is a write request, as what was alluded to in the channel. Such that if we had canary or some other pre-production fleet serving shadow traffic, we don't exactly know what the consequences of those double writes would be, because our tech stack is very complicated. We are working on a dev-prod separation effort, such that we would be able to serve those write requests safely. That is currently in flight.

Questions and Answers

Sombra: Please elaborate on the difference between the canary and the control?

Mukherji: Canary and control are identical in all respects. They should have the same exact configuration, run in the same container. The only difference is the build that they're serving. The canary has the new "unverified build." It's the canary in the coal mine. We're testing our new build, the new code on this canary. The control build has the verified production build. Canary and control are equal sizes, so we can do a pretty good apples to apples comparison between them in terms of finding anomalies in success rate, QPS, traffic patterns, anything that we would not expect to change when new code is being deployed. That way, we can detect if things do change and take appropriate action.

Sombra: I don't know anything about canary and control. I just like analysis and everything. It actually does signal to me that you should have a certain level of organizational scaffolding to be able to have nice things. Can you tell me a little bit more about, what is the size of the team that maintain this infrastructure? Also, what's your advice for folks that don't live? My suspicion is that it's large, it takes human effort and energy in order to be able to have more guarantees and more assurances. What's your opinion or what's your intuition about all of that?

Mukherji: Pinterest is a large org. We have over 1000 engineers working full time, and we have a team dedicated to CD or continuous deploy. That team is just a handful of people at the moment. Each engineering team does take a non-trivial amount of responsibility in their own deployments, in that the CD team provides the scaffolding and the software itself. They're the ones who are keeping Teletraan, which is our capacity configuration internal site. It's actually open source. It's on GitHub. Spinnaker as well, which is Netflix. They're the ones who maintain those services. Then the teams are actually responsible for setting up their own configurations and using the tools, updating their metrics analysis, that kind of thing. My advice is, it doesn't need to be so fancy. You can definitely apply similar approaches without having the full canary-control paradigm set up. You can have a canary really without a control, and look for anomalies in your canary metrics and compare it up against prod, for example, or look for changes over time. Any way you can build in pre-production testing and have a test suite that's taking production traffic. The reason why production traffic is really important, is because synthetic traffic is never going to mirror the full [inaudible 00:26:19]. Our integration tests definitely don't catch everything. We found that using production traffic is really the only way of verifying.

Sombra: Can we do a short recap of the base tools that you use in your deployment process. Zoom in on the ones that you would want the audience to walk away with, like you need one of this, one of this? Or, this is what we use?

Mukherji: I think you need a capacity configuration, like internal site software, some way of configuring ASGs. Depending what cloud provider you're using, we AWS, but you could do it directly through your cloud provider's console. You need some way of configuring ASGs of different sizes, basically. Then you need basically an orchestrator. I assume most people have some way of deploying, because otherwise, what are we doing? You need an orchestrator, like some way of orchestrating the deployments to the different ASGs. That can be something more fully fledged like Spinnaker. It could just be a simple graph execution engine. Then just some metrics analysis tools. We've really worked, after setting all this up to automate as much as possible, so that ACI fails automatically and we do pauses automatically, and that stuff. That doesn't need to happen right away to gain the benefit. You can do it manually as the first iteration, and then work to improve it from there.

Sombra: The next question is detecting drift. Can you detect or is there a process to detect functions or teams that have opted out from the decorator.

Mukherji: We've actually set up a build at CI time such that we don't let you land your code if you're not using the decorator. That has probably single-handedly, made the most difference. We did an ownership effort for our API. When we were adding ownership, at some point, we wanted to make sure that all new endpoints had owners, so we wrote some script that did some parsing and basically employed that at submitting your PR. We don't let you land your PR unless you have all the things that the framework requires.

Sombra: How do you use your approach, or how is your approach. Is it complementary, or how does it play with feature flags?

Mukherji: At Pinterest, feature flags, we call them deciders. It basically can take a value between 0 and 100. This value is given at runtime. It basically gates that percentage of people into the feature. This is used in conjunction with the decorator, in that people will have feature flags within the function that the decorator is decorating to turn on the feature. The decorator is mostly just providing safety measures, and error handling, and metrics and that stuff. It is possible that your decorator could have a feature flag built into it. We let teams decide what feature flag gating they want, because we have something called deciders. We have experiments that we run as well, which is a more fully fledged like comparison of an experiment and a control group. That's a more statistically significant comparison of like, with feature and without feature as opposed to deciders, or just a quick way of turning something on.

Sombra: How far out do you go when it comes to the concept of a canary. It's simple when you're dealing with your own application, but in complex deployments, my application can bring down a different system. How far out do you go in this evaluation where it's just like we have the coal mine as well as the bird?

Mukherji: That is really tricky. The way that we've gone about it is using our error budget framework. Using that in addition with a tiering system. We tier our services by essentially whether they're allowed to take down the site. Tier-0, tier-1 services are allowed to take down the site, and tier-2 and tier-3 are not more or less. Based on that, we can intelligently, like gracefully degrade services at different levels, and build that in. If we see a bug and we know that it's a tier-3 service, we can have a very quick remediation, where we try and accept it, and save. I care all about the Pinner experience, so any way that I can gracefully degrade the product, and save Pinners from an error page, I'm going to go that way.

Sombra: The tiers are expressed based on the customer impact, or the customer ability to notice.

Mukherji: That is one axis of what can constitute the categories of the tiers. Other things that we take in mind are like auto healing behavior, so tier-0 services are expected to be able to auto heal. What we were talking about, like the impact on the product, that really only applies for online serving, like the Pinner facing stuff. We have a whole offline data stack for which it doesn't apply, and so they have their own set of criteria for what constitutes tier-0 through tier-3.

Sombra: What about a feature that needs to be synchronized between more than one service, can you toggle those and how?

Mukherji: Yes. This is, I think, going back to deciders and experiments. Our decider and experiment "Configuration Map," that is actually synced to all hosts at the same time, so that all hosts across Pinterest have the same view, more or less of decider values and experiment allocations at the same time. If I made a change, and I said, I want to ramp this decider to 100%, and release my feature to all of Pinterest, and I was using that decider in multiple services, then I could do that. I could just switch it to 100 and it will get updated to 100 in all the services. There is some aspect of it can take longer for the config map to deploy to these different services, and so there can be periods of inconsistency. If your feature is really susceptible to problems with inconsistency, then I probably recommend using a different feature flag architecture, like a new parameter, some field that lets one service tell another service that the feature is on.

Sombra: Folks would like you to expand on what you mean by auto healing.

Mukherji: Auto healing can have a lot of different meanings, but one example is our GSLB and CDN. If a CDN goes down, we will automatically route traffic to our other CDNs. That's built into the algorithms and the configurations that we're using. That would be an example of auto healing behavior.

Sombra: It's like the system itself continues to make progress.

Mukherji: The idea is that the system will try to fix itself or takes a remediating action without manual intervention.

Sombra: Do you have any information about how much time you spend in manual verifications versus how much time you save from incidents?

Mukherji: We actually have done a lot of work around this, because when I was working on this, I really wanted to focus on how I could show the value brought. Because a lot of the times, SREs, we bring these processes to teams, and everyone conceptually knows it's a good thing, but don't have good numbers on this. For a false positive ACI alert, so like a noisy alert, we'll be able to promote the build within 5 minutes, basically. It's fairly obvious that it's a false positive. If it's a true positive, it can take 45 minutes or so to mitigate, which involves figuring out which commit broke the canary and reverting that diff, or making the remediating change, and then making sure that that build gets out basically. Forty-five minutes until we can resume our pipelines.

Incidents, if that were to go to production, so if we were to have a success rate job on some endpoint, go to production, it would take at least 30 to 45 minutes for us to roll back prod itself. Our API fleet is over 4000 hosts, and so it's extremely large, it takes a long time to roll back. It can take upwards of an hour for us to return to a healthy state. Then, at the same time we're going through the process of finding the problem and mitigating it. We save quite a lot of time. We catch about 30 true problems in canary, a half, the last time that I did the numbers.

See more presentations with transcripts

Recorded at:

Jul 01, 2022

Anika Mukherji

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?