Transcript
Casey Bleifer: My name is Casey Bleifer. I'm an engineer at Netflix. Today I'm going to talk to you about our journey to automate code changes across the fleet, regardless of the characteristics of the software, and how we are doing that with confidence. Also mentioning, we are still on this journey, so we're going to talk about some of our learnings along the way and where we are today. I'll start with a little story. It's release day, you're an engineer. Your team owns a library and you've just released a new version. You're super excited, you want to track the adoption. A month passes, two months pass, it's only at 75% adoption. Actually, you realize it plateaus at 85% adoption. A whole year later, even though many more versions have come out, you still can't get rid of it. There's still about 15% adoption. Actually, the version before is still there, the version after, all the versions you released that year are still there. Now your team has to maintain a bunch of active versions.
This story was inspired by an internal library that when I looked had 73 versions, whether it was real releases or patches. This leads to a huge long tail of migrations. I'm sure we all have felt the burden, but it can take months, even years sometimes to fully complete a migration. I'm sure some of you are familiar with the Log4j incident. That was an example where all-hands-on-deck were needed to make it go even faster than that because it was not good to have the vulnerability out there and we needed it to take much faster than years or months. This was a huge burden. It requires all hands on deck. If you can't deprecate something, you're left to maintain it. There's old versions of software running and there's a huge productivity loss, not only for the teams that need to run these migrations, but also for the teams that need to do them and keep their software up to date. It really takes away from your day-to-day work, and delivering new features and new innovation.
What If Migrations Took Days Instead of Months or Years?
We took this problem and we asked ourselves, what if instead of months or years, it took days to complete these migrations? We created a goal. What if we could automate all code changes across the fleet in one week or less, and for any critical vulnerabilities, what if we could do that in two days or less? Next Log4j, no problem. We had a few requirements to go along with these goals. One is minimal effort for all. There's two sides to these migrations that happen. First is the platform teams or the team that wants to drive this migration. Instead of them having to figure out, identify which software has the vulnerability or needs the update, figuring out the steps to do it. Asking teams, can you help us drive this to completion? What if all they had to do was just configure their migration and only review issues if they really needed to?
Then, on the other side of things, what if software owners, instead of doing all these migrations, constantly interrupting their daily work, performing the migrations, asking for help when they're stuck, what if they could just sit back like this person here and do nothing unless they really needed to put their input in or there was some information needed by the team running the platform migration? The second requirement is, this needs to handle the diversity of the fleet. These are just three metrics. There's a lot of different ways you can slice and dice a fleet, but there's a ton of different repo languages. Some of them are supported well, some of them are not. Some of them are tested well, some of them are not. There is different security requirements depending on the information that the software is dealing with.
I know at Netflix and many other places, we have a lot of different business units and sometimes those are handled differently. We have some services that have more quiet periods, some services that are in monorepos, some that are not. All of these different aspects are important to consider. There needs to be a way to automate while still respecting the constraints of each of these segments of the fleet, but do so in a way where everyone can receive the automation. The last requirement, and this is really big, is safety. We don't want our automation to break things. We actually want it to fix things and make it better. We wanted to build in safety checks along the way that would help make sure we can run our automation at scale.
That includes things like validating the changes prior to the rollout. Phasing rollouts by criticality so if we detect something in the lower criticality apps, then we can potentially pause the migration and prevent a higher blast radius. Working with teams across the company to add compliance checks in to make sure we're not automatically doing things when teams don't want us to or we really shouldn't for certain security reasons. Additionally, easy interventions. If you're running a migration and something is detected, there's a big red stop button at any point.
The Solution
We took these requirements, we took this goal, and what do we build? We build a fleet-wide automation platform, which I'm going to talk to you a little bit about today. Hopefully this is the densest slide, but I wanted to say some terms that would be helpful through the rest of this presentation. Platform teams that need to run a migration, they create these things called campaigns. All of the software that needs to undergo the migration are called targets. The targets run along a path, which you can just think of as a set of automated steps. I'll be using these terms a lot. Then the other nuance I want to talk about is the distinction between rollouts and deployments. Specifically in this presentation, a rollout is the orchestration and how the steps are progressed through, and a deployment is the actual delivering of the change to some infrastructure. At the heart of this, of the migration path, are these composable steps.
These steps are predefined units of automation, and they each have their own state. When we started building this platform, we worked with a lot of different teams within Netflix, and we realized that while there are common things that need to happen in the migration, there's also a lot of customization and flexibility that's required. Each migration is unique. Some might need manual input, some might not. Some might need things that we haven't thought of. We wanted to leave the door open for customizability. Much like these Lego steps, we built these composable steps that can be fit together like Legos. Some of them do have prerequisites, so you might find a couple fit together more commonly, but this really allows for the flexibility of path creation. No matter what the migration is, no matter what the team, they can pick and choose what they want. I'm not going to go too deep in this.
I don't want eyes to glaze over on a system diagram. Basically, what powers this all is our event-driven orchestration. Essentially, it's just a loop between our state machine and our event consumer. We wanted to decouple them because events ultimately can come from anywhere, whether they're coming from inside our systems or one of the systems that we integrate with. We wanted to leave that door open and have them decoupled. When an event comes through, it gets processed by the event listener and ultimately gets sent to the state machine. That is going to process the event on a deeper level. The event, coupled with what type of step it is, like, for example, creating a pull request, it takes that metadata and then figures out which step handler to launch. The step handler is basically just a child workflow that runs the specific piece of automation. Diving one step deeper into the state machine, it has a lot of responsibilities.
As I said, it processes events on a deeper level. It determines what step comes next, so it helps these targets keep moving along the path. It keeps track of the step state, updates it, launches workflows. Then it handles our edge cases, such as pausing targets if there's an issue or a quiet period, resuming targets. Then there's also the failure category. We have terminal failures and retryable failures. It figures out the best path forward based on the type of failure that's happening. I know I mentioned that flexibility is important, but we also realized there's a big subset of migrations that are similar. Not to be confusing, but for things like one-time deprecations or one-time migrations, the customizability is a huge, important factor. Then we have things like dependency updates or reoccurring updates that tend to follow a similar path. Given the steps that we've provided, we also have our platform offer some provided paths. These are just two of our most common ones.
As you can imagine, a code change path with or without validation that basically changes the code for you, creates a PR, merges it, and monitors it through deployment to the clusters. I'm going to go through our most common path for you because it's going to set up to talk about what we're doing today and how we're building confidence. This first step is the code transform step. This is another area where we built with flexibility in mind. Essentially what it's doing is it's launching a container and that container does the code transform. This gives flexibility for the platform teams to either provide us a custom script. We've even had some GenAI prompted containers provided to us by the campaign managers, or our platform also offers some pre-configured codemods that can do common things, for example, like a dep update or maybe updating some delivery configs. Once the code is transformed, we create a draft pull request.
This is another area where we're building in some safety. We don't just automatically create a pull request, activate it, and merge it. First, we need all the PR checks to pass. This is where the second part of the validation comes in. The pull request, while still in draft, will then move on to the validation step. This is where additional checks can be configured. Currently at Netflix, we've partnered with our resilience team, and the validate step can launch all the canaries they offer for the change. In order for the step to progress on, the canary needs to pass. If not, the rollout will stop there. We'll invite the platform team to take a look and see if their change is potentially causing something unwanted into the system. This is one area to prevent issues going forward. Another thing to keep in mind is it doesn't have to just be canaries.
This is extensible for other types of validations. If the team running the change has a custom test they want to run, they can introduce that in the validation stage. This is actually one area we want to expand on in the future. Once all of these validations pass, all the PR checks pass, now it's time we can activate the pull request and merge it. Wait, there's some checks we need to do. I listed them here, but basically this is yet another area where we're building in compliance. We don't want, and a lot of teams don't want, just some service to come in and create code changes, and merge and merge and merge, and then before you know it, they have no idea what's in their code. We worked across the company to identify these compliance checks that help us better figure out when it's appropriate to auto-merge or not.
An example of this is respecting repository permissions. If those permissions are there, they're probably there for a reason. We're not going to override them unless the owners reach out to us and maybe want to have a special rule for our system. Is the pull request in a mergeable state? Does it pass all the checks? Then the last thing I want to highlight, which I'll go into more in a little bit, is the confidence rating. Is the team confident with another system coming in and having all these changes? Let's just say everything is great, the PR auto-merges. This is not a huge step, but it's a final verification before deployment to make sure the build looks good on main. Last but not least is monitor deployment. The change has merged. It's going out to its clusters. I do want to point out, we do not actually do the deployments, but we're integrating with another team, the Spinnaker team.
We use their events to monitor the deployments. We also hook into their systems to observe quiet periods, so when services are not allowed to be deployed, we can respect those as well. Then, finally, when that's done, the rollout is successful.
Building Confidence
We have this really cool platform. Everyone's going to come running? No, they won't, because of a lot of different things. One thing is because it's new. Going from 0 to 100 is just not going to work well. As I showed earlier, there's a lot of different software at Netflix with a lot of different characteristics. Some is brand new. Some is old and is not as prepared to be automated. Secondly, there is also an element of people not knowing what the software is and not trusting it. This is when we decided to switch to an approach to build confidence. As I'll discuss, confidence is a two-way street. The first thing is probably more obvious. Are we confident in the automation? Is this platform going to do the thing I want it to do, do it successfully, and do it at scale for everything we can think of? Most importantly, once it does it, is it safe?
Then the second part of this is all of the software owners have services, libraries, jobs. Are they automatable? Can they receive this automation? Are we just running this in the ether and it will never be deployed, or they don't have tests to maybe catch if there's an issue that just affects their service? Also, even beyond the technology, are software owners themselves just feeling comfortable with this? Do they want this to happen and happen at the scale we provide it? We took this, what I'm going to be calling recipient confidence, and in a broader program at Netflix, outside of this fleet-wide delivery platform, we used all of these things to create a confidence metric. This confidence metric is one of the things I talked about that we look at throughout our platform to help ensure the change is safe. If something is high confidence, we can merge the PR automatically.
If it's not, the team and the platform have come to an agreement that human intervention is still needed. We decided, with us all in mind, to take this huge problem, even if you just look at services alone, there's over 4,000, that's not including libraries or things that do not have a deployable service at the end, and we said, we're going to start small. We're going to build confidence on that small segment, both ways, not only in our platform, but build confidence with the teams we're delivering these changes to. Address any obstacles that come up. Then expand and do it all again. We started with our Java ecosystem. This is just to show how expansive that is. We honed in on JVM ecosystem, specifically services across all of our business units. We designed an exercise. This is our confidence exercise. We took the code change path that I shared with you, and we did a very simple change, which was basically just a log change.
Confidence Chimp was born. If you're familiar with the rest of Netflix's Simian Army, Confidence Chimp is the newest member, and he's helping keep our fleet safe. Basically, the idea behind this is that if we're just introducing a log change, and we run the service through this automated path, anything that goes wrong is one of two things. It's either a deficiency in the platform that needs to be filled, or the service perhaps is not automatable, and there's a theme that we can distill from that, and work with the service owners to improve. What do we do? We ran our first Confidence Chimp exercise. We took 120 Java services, and we let it run. We're like, let's see what happens if we just let the automation run. It took 86 days to complete, and 44% of the targets required a manual intervention, either from our team or from the owning team.
We decided to take a look at the data and say, why is this happening? What can we take away from this? There are three categories, and I think you'll notice from the slide that most of them are external categories. There's of course the changes that we're in control of that we can use to improve the tooling and make it more resilient next time, but then there's also two other huge things. We integrate with a lot of different things at Netflix. We have all these partner team integrations that we need to make sure our contracts are stable, that we have high uptime, and that we can improve those relationships. Then the last part is the service owners. What if there actually was an issue detected? What if their PR was sitting open for three weeks and no one saw it? What if their service hasn't been deployed for six months? These are how we started breaking down the obstacles we found.
Now I'm going to go into three of the major themes that we saw. There's a lot more than this. On our very first exercise, as I said, 44% of targets needed someone to intervene. We realized that although we did have notifications running, there was a lot of times when they were going out into the ether. No one was receiving the notifications because there was no contact metadata associated with the service. Or, one step further, if we needed to manually reach out to someone, we would go out on Slack, reach out, only to find it's the wrong channel. It became the goose chase of hunting down who the software belongs to. This was not the sole reason, but this coupled with some other security incidents led to an effort at Netflix to company-wide make sure all of the contact data was populated and accurate. That's many things, including the on-call, if we need to escalate there, the Slack notification channel to make sure someone is seeing it, or even an email for the team as a last resort if all those other things are missing.
The next big area is pull requests. We noticed that when we weren't able to auto-merge a pull request, that it would sit open for a very long time. This could happen for a couple reasons. One is the compliance checks I talked about, if it wasn't passing those, or it had a flaky build, or for some reason we weren't allowed to auto-merge it. They would just sit there. This was an area where we're like, let's see what's going on here. I won't talk about everything, but this was a myriad of all those three problems I talked about. One thing was looking at our compliance checks and fine-tuning them. We don't want to be unsafe, but we do want to make sure we can merge as much as we can without human involvement. Then the second thing was the correction of these PRs. Any PRs that ended up in a bad state, someone on the team, out of band, at some point would go fix it. Then we added some checks to automate, detecting when something went from an unhappy to a happy state so that we could resume our workflows and not have to have someone go ahead and manually look and say, ok, this PR is good now, let's restart.
This PR is good now. We started detecting manual merges better, rerunning merge checks, and we even added even more notifications in terms of commenting on PRs that had been open for a while. Then the last major theme was deployments. As you can imagine, now the PR is merged, but the deployment frequency of a lot of services was much longer than the seven days we were aiming for. We worked with our delivery team at Netflix and we identified a bunch of services that had manual gates in their deployments that were either well-tested or statistically that manual gate was not actually preventing any issues in prod. We worked with them to identify those services and we removed those manual gates, and it led to a 77% decrease in manual interventions from deployment approvals, which did speed things up. There's obviously a subset still that do need those manual gates and so those were left for safety reasons.
This is just three of the examples, but these are all things that basically we did. We fixed the issue. Then after we ran the experiment again, re-looked at the data, truly an all-hands-on-deck effort from our team, other platform teams, everyone at Netflix willing to participate in these exercises and make the improvements needed.
Today (Ongoing Journey)
That brings us to today. As I mentioned at the beginning of this, this is still a journey, we are still very much on it, but we've made a lot of great progress. As I said before, we started with 120 targets, they were Java services. Our most recent campaign ran on over 2,000 targets. The time to complete the campaign started at 86 days, and we've now reduced that down to 26 days. You'll notice that's still not at our goal, but it's huge progress. While we've been doing this, we've been expanding the amount of services that we've been rolling out to. We've been simultaneously expanding and bringing in new obstacles as we go while driving the time down, which is pretty exciting. Then the last metric, 44% of targets requiring a manual intervention is now down to 21% of targets in our last completed exercise. There's still a long way to go here, but we took those two biggest buckets from pull requests and deployment and we were able to drive those down and are still working on partnerships with our other teams to see if we can figure out areas to continue to drive that down.
Now the fleet expansion. I was a little vague when I said we went from 120 to 2,000, but we started at 3% of our entire fleet of services. There's about 4,100 services, and we started with just the Java ecosystem. In our most recent exercise, we ran it on 50% of services, including Java and some of what we're rating our high confidence Python, and that in this case means they're well tested enough to be part of the exercise. The most exciting thing, even though we haven't run the exercise on this, this is what our platform capabilities have. Our platform capabilities can run on 66% of services, including Java and Python. The reasons we haven't run the exercise yet is some of the external reasons, meaning we need to make sure the teams are comfortable and have confidence that their services can run through these. Now onto the real migration. At the same time as running all these confidence exercises, we've also been helping teams run real migration.
There's been a lot of cool things that we've been able to do. We've helped facilitate some JDK migrations. We've worked to keep delivery configs up to date and modern. We've done software deprecations or introducing new software, a port migration. One exciting one that I think the company is excited about is we did a GenAI prompted migration that helped simultaneously move off of an old library onto a new one. That was super exciting. Now what's next? Like I said, we still have a long way to go. I think if you remember that graph from the beginning, the services were just a very small slice. We want to drive to completion on the service coverage, and this would help us unlock automating migrations for our JavaScript and Python ecosystems as well. Then the software type expansion. Libraries, jobs, repositories, these are a huge part of our software ecosystem. That's what we are hoping to tackle in this next year, and making sure that libraries are validated before their release and making sure that libraries are up to date so that teams are maintaining less versions.
The next thing we want to do is go back to our data and increase the provided paths that we have. One example of that for this year will be Python dependency updates, making sure we can automate those reoccurring updates by using a path that we provide. It's as simple as a very short configuration for the Python team to set up their deployment updates. Then last but not least is taking the example we had from the GenAI prompted migration and abstracting that in a way that it can be applied to any migration. We're going to take our learnings from that and hopefully make it so that the code transformation path is more flexible for GenAI prompts, and just configurable based on the prompt you enter, what resources you need for that prompt container.
Key Takeaways
This has been a really great effort. We're still on the way to automating everything, but we've made a lot of progress. I think everything we've done from the beginning just makes it easier and easier to build on top of. We have learned, confidence is a two-way street. Even if you have the cool technology and the cool automation, you need to work with everyone involved to make sure that they can be automated. You need to make sure that they trust in your platform and that your platform is also delivering that trust back. The second thing, I actually think this was in the keynote too, is don't boil the ocean. Break your problem into smaller problems. It will be difficult at first, but the more you do this, you'll build the foundation so that the later problems get easier to solve. Last but not least, teamwork makes the dream work. It's not about the software. It's about the people and the partnerships. There's a lot of integrations, a lot of things that come into play to make this orchestration run smoothly. The human element is just as important.
Questions and Answers
Participant 1: You talked about this platform. Do you guys use GitHub Actions? You would use GitHub Actions to run these campaigns?
Casey Bleifer: GitHub Actions are not involved in campaigns. Some teams might use them, but that's not the approach we went with because we have all these other safety checks in place that we'd like to keep. It's not off the table, but that's just not how we have it set up right now for some of the other internal concerns we have.
Participant 2: You talked about removing manual interventions, manual quality gates where possible. What challenges did you face there? What was your approach, overall? How do you approach that?
Casey Bleifer: Every type of manual intervention has a different solution. I think one of the overarching challenges is, what is the balance? Because sometimes you might have a couple teams complaining and say, why do I need to intervene here? Then if you step back and look at the bigger picture, there's actually like a valid safety reason. I think it's finding the balance between making people happy and operating safely.
Participant 3: How did you handle the problem of doing auto-merges? Like you're pushing the code, you're doing the pull request, but the product team could potentially have 10 other changes ready to go. How did you handle that coordination? Also, the fact that not all teams have canary deployments or really robust test cases?
Casey Bleifer: There's a few different ways I'll answer that, different concerns. For the PR part, specifically, we worked with our central security team to get special permissions, basically. We had to do a lot of different things to make our platform available to actually create the PRs and do the auto-merge on behalf of people. Sometimes we can't. If you're familiar with SOX, PCI compliance, for legal reasons, we can never, and someone needs to take a look at that. That was the PR part of things.
For the stack deploys, that is somewhere where we chose to integrate with our delivery partner, Spinnaker. That's on how the team handles things. We wanted to delegate to their delivery setups. That plays in a bit with the confidence rating. The confidence rating ties into, is the service canaryable? Do they have validation set up? What is their deploy situation like? That is basically captured in this confidence metric. Services that don't fall into that, they would be considered low confidence. They can still be part of these exercises and part of migrations, but it's with the understanding that you're going to need to be putting in some manual work here. If they don't want to do that, then they need to be willing to work with this broader program we have. We have a broader program aimed at increasing confidence, and they need to be willing to work with that program to get to a high confidence state, so that we can run these with them, without as many interventions from them.
Participant 4: I am still curious about like any further elaboration that you could give. It seems like even if you have a service and it has a canary and it has good tests, there are a lot of minor things that could create small correctness bugs and stuff later. I was curious if you have additional validations that go on top of like, did this affect the latency? Are there way more errors or something? Just anything you have on that front is interesting.
Casey Bleifer: I think there's multiple ways we approach this. Obviously, we do as much as we can, but there's a certain layer of we have to at some point rely on what these teams have in place for their own telemetry validations. In addition to what we have, we also have whatever they rely on, whatever they have in CI/CD. Then there's the other side of things where every migration is different. That's why when these campaigns are being set up, we encourage them to leverage validations, and the campaign owners or the platform teams that own them can set up validations that are tuned to pay attention to their change and have certain metrics go off. When they're running their migration on their first set of apps, if the metrics they're tracking are not looking good at a high level, then they can intervene there. Once you get to the individual service level, it would be like any incident where they would need to have their own monitoring and alerting in place to be able to detect issues.
Participant 5: What type of issues did you run into with the security department in gaining that trust in the automations, that they will do what they need to do and protect customers, and things like that?
Casey Bleifer: Since the team that's approaching this, I guess my team is newer, we actually got ahead from it at the inception of the team. We had a lengthy security review. We went into it with this huge checklist of things that we needed to account for right away. Then, anytime we pivot the platform or expand, we basically have a point of contact from our team and the security team that we take along the way with us. One thing right now is actually in the auto-merging space, like, are there certain types of changes that we feel comfortable just auto-merging, maybe like a dependency update or something that happens a lot? Is that fine, or should we be safer there? That's when we delegate to our security partners.
Participant 6: Are you using any frameworks for doing the transformations, like OpenRewrite or anything like that? Is there any kind of threshold for things that were deprecated that are dropping off that you wouldn't include as part of a campaign just because the code rewrite would be too large?
Casey Bleifer: Part of the flexibility is we allow people to choose what frameworks they want. We're simultaneously allowing these custom migrations and providing our own. We are framework agnostic because we wanted to stay with the flexibility piece. If a platform team has something that they like using, if it helps them get their code transformation container up and running quicker, we'll let them use that. A long-winded way of saying we don't have a firm opinion yet, but as our own team provides more of these pre-configured code transforms, we might bias towards something, but nothing yet.
Participant 7: Do you set a different set of metrics to have a confidence level determined or does it change campaign by campaign, and how do you increase them, keep evolving on that metrics?
Casey Bleifer: This is a constant set of metrics across Netflix and across all campaigns. There's a separate program that we have, the goal is to keep people in the program and make sure they go from low to high confidence, or help make sure they're complying with these things that we've pointed out to be high or low confidence. We are trying as we go. Right now, this metric is a bit more on the qualitative side. It's like we've gone through this program and now our team is rating it, it goes from like strongly disagree to strongly agree. Our team is rating it this way, but we're moving to make this metric more a calculation based on other metadata and the software so that it's more fluid over time and we can detect better if something drops out of a high confidence or not.
Participant 8: How are you managing secrets on the platform? I assume you might have a horizontal secret management system for something common like SCM credentials, but you said teams could potentially run automation tests. Are they managing their own secrets at the campaign level?
Casey Bleifer: When a campaign is set up, the campaign team has their own permissions, but then running through our platform, for example, if we're launching a canary on a team's behalf, that assumes our permissions.
See more presentations with transcripts