InfoQ Homepage Presentations Paying Technical Debt at Scale - Migrations @Stripe

Paying Technical Debt at Scale - Migrations @Stripe

Bookmarks

View Presentation

Speed:

Download

43:45

Summary

Will Larson talks about why migrations are the only mechanism to effectively manage technical debt as their company and code grow, what makes running them so hard, and a repeatable approach to running them effectively.

Bio

Will Larson leads Stripe’s Foundation Engineering team, which provides the reliable, performant and usable platforms and tools for Stripe’s engineers and users. At Stripe, he’s had the opportunity to be part of their development of Veneur and Sorbet, migrations to Kubernetes, Envoy and Bazel, as well as providing the infrastructure for the launch exciting new products like Terminal and Issuing.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

I am Will Larson. Thank you for the introduction. And I'm here to talk a little bit about technical debt and how to actually to deal with technical debt at scale. This morning I got a tweet from someone who was like, "I'm really sorry I'm missing your talk," and she was like, "Aaah." Felt great for a moment and they said, "So you've probably dealt with a ton of technical debt so far," which felt a little bit of kind of backhanded. So, not sure how I felt about that, but the really technical debt is kind of a human experience. Particularly as you get further in your career, managing technical debt becomes kind of a constant, and learning to manage it well I think is really important. It’s a kind of deciding factor of whether a company is long term successful or not. Let's get into it.

So I am working at Stripe. I've been there for a couple of years on the foundation SRE team. We do data, we do developer productivity, and we do infrastructure. And before that, I was at Uber for a couple of years or so. Started an engineering team there and did kind of a bunch of infrastructure-y things there too. I don't know how many of you were riding Uber four or five years ago. At that point whenever you took a trip, the driver would tell you, "Oh, this is my very first trip."

And that was actually a kind of a scheme that was on their user forums, telling that if you told the riders that every trip was the first trip, they would rate you more highly. So this is my very first conference talk. So starting with kind of what's the goal of the talk and just kind of tell you the conclusion right upfront and then just hitting it over and over and over as I've been told to do. So technical debt is the core constraint for velocity at your company as you get larger.

And migration is fully replacing a tool, a system, or a library. It's not partially replacing, it's not getting like one and a half, it's not getting three, it's fully replacing something, making the old thing go away. And the way to do migration successfully, the way to make things actually work is to treat every single migration like a product.

What Is a Migration?

So first, what is a migration? And then we'll get into like, do they actually matter anyway? Like, can't you just like skip them? Like switch jobs every couple years and avoid them pretty easily? And then finally, how do you actually do them well? How do you actually succeed if you aren't trying to jump ship every time it gets hard?

A little bit like Jessica's [Tai] talk earlier, only one animal picture, but I did not realize flamingos migrated, and I just thought that was an exciting fact to share with all of you. So, a migration. Fully replacing a tool, system or library. What are some examples? A great example at Stripe from I think last year was we moved from a tool called Chronos, which is a second-tier scheduler on top of Mesos, to using Kubernetes cron jobs. We fully deprecated Chronos, got rid of it, moved entirely to Kubernetes. It was an adventure.

Another example, kind of the classic example of a migration, the AW, or the rather the Netflix move from kind of their own data centers onto the cloud. Interviewed someone at Netflix a couple years ago and they had this amazing story about this, where the very last piece they moved was their data warehouse. And they actually just gave their data warehouse back to the vendor that sold it to them and then it became a platform as a service. And so that's one way to fully finish is to just kind of redefine kind of what success is, but, worked for them.

Another very controversial kind of one was an Uber move from Postgres to MySQL. There were some angry people on Hacker News, maybe a success metric. Earlier this year, Dropbox moved from Python 2 to Python 3, which kind of seems like an absurd migration to think about, but it's actually like a really huge change to the backwards incompatible change to Python, and moving that was like a really important switch for them. Migration. Fully replacing a system, library or tool.

Do Migrations Matter?

So do they even matter? So we've kind of defined what they are but should you even care? Should you kind of try to walk out of this really crowded room somehow?

This is what happens to productivity over time. When you first create a new code base, you are a small little startup, everything is easy and you can get things done really quickly. If you've ever been at home and seen someone tweet like, "I could take your website," or, "I could take your product and recreate it in like two hours over the weekend," if you've ever had someone like say that about your work, then, you know, like yeah, you can when you're a really small company or really small team, a brand new project, you can do almost anything in two hours.

But as the code base gets more complicated, feature by feature kind of the current code and kind of what the code was designed to do originally that drift further and further away, and it ends up that making any change gets harder and harder over time. Teams like to be productive though. Teams don't like to be unproductive, as Jessica's [Tai] talk earlier, kind of talking about how painful deploys got at some point at Airbnb I think is a great example of that. So teams will do whatever they can to get more productive.

They'll start rolling out things like code review, linting, whatever they can, they'll start trying to make their lives better bit by bit. And each time you roll out one of these, things get a little bit better. You do another one, do another one, but then there's kind of this like trough of sorrows that happens at the very end, where you just like run out of good ideas that you can actually implement. And so the question is what happens such that you kind of get here and you just stop improving things?

And really, what happens is that teams will self-authorize, teams will, from the bottom up do whatever they can within their own scope to make things better. They will make changes to their own code base, to their own services, their own libraries, their own tools. But eventually, they'll run out of things they can self-authorize to do, and then they start having to coordinate with many other teams. And a lot of times, folks think that the challenge here is that, you know, they can't get any alignment.

But really I find like most big changes, like everyone agrees they want to do it. It's like an obvious good, like, well, of course, we would just switch to a less technical debting system, right? But with happens is, it's not just like agreeing on the solution. Also, a lot of it's like getting the timelines to mix, to match, rather. So maybe I really want to do this on the backend team and then the frontend team is like, "Yep let's do it the next quarter."

You get this like, kind of perverse situation where literally, everyone agrees they want to do something and it's urgent, but you just never get it done. And that's because there's just too many teams that have to align approach, prioritization, timing to get things done at some point. And so you see early on, like things get better. Also, my charts are like really terrible. It's like kind of what I do in presentations. Terrible charts. It's a specialty. But early on, things get better, and then you just like run out. So what do you do after that? And it's migrations.

So migrations, taking something that's not working very well, library, system, tool, fully replacing with something that's more productive. And this is a bit of a simplification, right? Like, migrations aren't like done magically overnight. It's actually like, maybe gets worse more quickly for a little bit. But kind of take the generalization for what it's worth. You finish something, you throw away all this technical debt. You pick another project, you work on it for a while, things get better. Pick another one and sequentially work through.

But something interesting is happening behind you, which is that your organization isn't standing still. Early on you have one engineering team making these changes, it's really quick. A couple years later, you have three engineering teams. It's a little bit harder, but still doable. Then you have 15, 50, 500, 5,000, and all of a sudden, these changes get really, really complicated. And instead of doing like one migration sequentially, you actually find that you're in this world where you're doing like three, four, or five, six migrations sequentially.

And you end up here. So this is what happens when you have your product engineering teams here, like, News, API, Ads, are spending almost all of their time only doing migrations, that you have almost no bandwidth to actually do work for the users. All of a sudden your business, your company, is subjugated to these infrastructure needs, you're not actually spending time creating value for your users, you are not creating stuff that's new and innovative. And you have this really perverse world where all of a sudden your product engineers are spending all of their time doing infrastructure-related kind of shifts.

The infrastructure teams often feel kind of pretty good in this spot. They have fewer dependencies, so it's less obvious to them. So the feedback loop's a little bit weaker. So these infrastructure teams that are creating more work for others, don't often realize just how hard it can get for the product engineering teams or the teams on kind of the top of the stack.

Successful VS Failed Migrations

So now you're busy, but what happens? It gets worse. And the reason it gets worse is that right now we're assuming that migrations actually succeed. A lot of migrations don't succeed, and so you kind of have this bet, you decide that you're going to replace your RPC layer, you're gonna replace your orchestration tier, you're going to replace your front end library, going to replace your design language. And it just doesn't work. It's all of a sudden you've spent all this time, gotten all these teams aligned working on this, and then it fails. And this is like, pretty embarrassing. You didn't want it to fail. But also your productivity just keeps going down, and all your lifeline, your kind of time before before you hit zero productivity, is a little bit shorter.

Hyperbahn

A really great example of this is the Hyperbahn project at Uber. So Uber historically used to route requests using HAProxy. This was a really great simple way to do routing, where each server had a configuration built off Clusto. Clusto is a little bit like console, but from Digg, and not adopted by other companies. And in Python, and had some issues, whatever, it was pretty cool. So we would build these HAProxy configs and then the sidecar model for HAProxy, where we do routing locally, no kind of shared distributed state.

This was really good for many things, but as you add more and more servers, the lack of kind of centralized configuration or centralized kind of state for coordinating routing got quite toilsome. And so this Hyperbahn project was born. And I think the most important thing about Hyperbahn was that it was significantly more advanced than the previous thing that it was meant to replace. Circuit breaking, very sophisticated. Rate limiting, it had kind of preemptive retries, had like all sorts of great functionality. But the interface changed and it meant that it was very difficult to actually get services moved from one to the other implementation.

And eventually, after much time, this ended up getting kind of enough services in that we had to maintain it. But we couldn't get all the way there and it ended up getting backed out. And so this was something that's kind of the worst state of a migration, where not only did we not get faster, we actually had to spend time moving and then moving back. And then the result of this is that, you know, years later, like, this is an abandoned project. And I think the most important point here is not that Hyperbahn was bad, it was actually significantly better piece of technology than what it was meant to replace, it's that the migration itself failed.

Teams Bet on Migrations

And this is what we're going try to figure out how to do better. And I think migrations fail, but also like there's like a momentum to migrations. Like if one migration goes poorly, these teams are like, "Well, I spent a lot of time working with you last time and then I had to like a revert back." And so they're like, "Definitely not going to work with you this time, but if you're almost done, maybe next time. I'll be the last one to move." And so you get this momentum where it can actually, if you fail one migration, people don't want to work with you on the next.

Digg v4

Conversely, there's also the good story. If you do it really well, like people will want to work with you next time because they will know that you're going to save them time. They'll know that you're going to reduce their technical debt. But this momentum of migration success and failure is quite important aspect as well. And something that happens, as these red circles indicate, is you can actually fail enough sequential migrations that you have no productivity left. If you've never worked at a company that reaches this phase, you won't believe this is possible. It's like it can't be possible. It's like we manipulate dream stuff, right? Like we can do literally anything. How could it be impossible to make forward progress on a code base? But I've been on multiple teams, companies that have reached that spot, my favorite of which is Digg v4. I was talking to Randy earlier, he didn't realize I had worked at Digg, and he mentioned this as an example of a terrible migration. So ironically, it's something I'm very proud of the work we did, but it's also a complete, complete disaster. So basically the previous code base called LOLcat, because we were professionals, was quite hard to work with. It was just a PHP monolith and we ended up replacing it. But we decided to not just like replace it.

It was like five or six years later, what if we like really pushed it to make it the best it could possibly be? You know, MySQL is kind of old, maybe doesn't even work that well, there's this new Cassandra thing out of Facebook. I don't know if anyone's ever heard the joke that Cassandra is like a Trojan horse released by Facebook to ruin a generation of startups. Cassandra of 2018 is actually a phenomenal piece of software, right? Cassandra of like 2011 was a little bit early. And so we switched to Cassandra, we switched to Python for the backend, we switched to Thrift. We did all these switches. These are just SQL, right? We took every single algorithm and like deleted them and started over, for no reason. New is better. It didn't work. People hated it. They stopped coming to our site. We ran out of money, we closed the company, everyone went home. So this is what happens if you really mess up a migration, and enough of them, is you run out of productivity, you do this whole rewrite, which is really, really quite dangerous, and then you go home and you lose your job.? So do migrations matter? And the answer is yes, because you don't want to go home and lose your job.

Interfaces

Quick aside. Interfaces. If you don't like migrations and like really you should not in any way like migrations, the best way to prevent migrations is having strong interfaces. A strong interface, in this sense, is kind of having as little overlap between teams as possible. This means teams can self-authorize longer in terms of like finding productivity improvements. And this means that you'll get to push off kind of the dreaded decline of productivity a bit longer by relying on kind of smart, thoughtful, people who can self-authorize within their own scope.

Effective Migrations

But conversely, weak interfaces, lots of overlap, poorly-defined boundaries in a monolith etc., can't do much, you end up doing migrations much, much sooner. Okay, so we know what a migration is, we know that they matter. Now, how do they actually work? Every migration is a product. This is the mantra, I'm just going keep saying it a few times. Three kind of stage risk or a three-step process, rather: Derisk, Enable, Finish. Because every migration is so important, your company can only do a couple of them at once. There's a real constraint to the number you can do concurrently.

You have to really think about placing good bets. And this means a couple of things. The first is “Is this worth doing?” And a lot of migrations, I think, honestly, don't pass that bar. A lot of migrations happen for bad reasons. So here are some strategies to actually make sure that what you want do from a migration perspective is worth doing.

Finding a sponsor. It's something that happens very frequently and I think it's fascinating is that you decide to do something, you decide that you want to rewrite to a new database, a new back end, a new Digg v4, and you just can't find another team that's actually willing to buy what you're selling. So this is not an executive sponsor. Executives are wonderful but you can convince them of almost anything if you try hard enough. What you're looking for is an engineering team that's busy and is a willing to prioritize your work over what they're currently doing.

If you can't find that team, you're probably selling something that isn't worth selling. And it's like so important to find another team that actually believes in what you want to do. And maybe two or three in case you find one team that just really likes to please you.

Two, opportunity cost. The way to pick a migration is not that it's valuable. These are so constrained that you have to figure out is this the most valuable thing you can do. This is so important. You can only do a couple of these at once. These takes so much time. These are strategic bets.

These are some of the most important strategic bets your company will ever take, and they determine your ability to actually write new functionality or to end up stuck delivering nothing. And often we're just like, "Yeah, cool, if we'd like move to this new thing." But, really the opportunity cost, is this the most valuable thing you could possibly be doing with this time? It's kind of sacred trust you as senior folks are given is to pick this, and it determines your company's future, and I think you have to be respectful of that trust.

The second - not invented here. I was talking to someone recently and they're talking about the number of startups that are building their own kind of distributed highly consistent or strongly consistent distributed data stores. And so it just turns out like this is a really, really hard problem. And these weren't like database companies, right? These were like kind of random startups who are also trying to build the next Spanner, the next Cosmos DB. And this is almost an impossible problem to solve if you don't have little atomic clocks falling out of your pockets everywhere. So it's just not a good thing to do, but it's a really cool thing to do. And so I think a lot of times folks get caught up with how interesting a problem can be, and they kind of skip whether it's actually worth doing. Don't do it.

Another one, hammer looking for a nail. The first time I practiced this talk, I gave a bunch of examples of things I thought were bad choices, and then it turns out there's no way to pick anything that doesn't offend at least half the audience. So what I want you to do is just imagine something you've seen someone adopt that went quite, quite poorly, and you were like, "Oh, they just did that over the weekend and inject that here, and that's the thing I'm talking about. Why did they do it?"

The Design Document

So if you know something's valuable, if you decide something is truly worth doing, the next problem is will your solution actually work? It seems useful. The design document is the first step. There's kind of like three steps of design document. The first is you're trying to prove to yourself that this will work. Write out the examples. Actually convince yourself that your approach is viable.

The second is you want to go convince some of your customers that this will work. This is actually about a viable solution for them. Third, and people typically skip this step, I think it's super important, is go find the detractors and try to convince them that your solution will work. It's one thing to get people who are already pre-supposed to like your idea to work. But if you aren't going and finding the people who hate your idea, who think it's terrible, who maybe are suspicious of you in general and getting their feedback, you're not done.

A good design document doesn't have the reasons for why you should do something. It has those, but then it has an equally longer, longer section of why you shouldn't. If you can't articulate why something shouldn't be done, you're not done validating whether or not your solution is going to potentially work.

Prototype

Prototype. A lot of time prototype means like V0 to folks. But prototyping is not about building like the first version or a quick version. It's not about like practicing the right implementation. Prototyping is really about de-risking while there are solutions possible. And so this is like find something they can do in two hours. So if you're trying to move from Puppet to like Dockerfiles for your server configuration, just take a role in Puppet and just rewrite it into a Dockerfile. It can be terrible, but your goal is to take a couple of hours and to see if it's possible. You're not trying to build a good version, you're not trying to build an implementation that will last, you're just trying to make sure that if your approach is terrible, you're like, "We'll spend two hours on it then stop."

Embedding with Early Adopters

Embedding with early adopters is a way to go really deep on the prototype. Make sure that you can actually solve all of their needs. Again, a lot of this is engaging as quickly as possible with the user right, again, just good product design. If they see edges, and they'll see all these edges that you won't see, I think often when you operate a system, you kind of get this abstract sense of using it, and that's why you need to go actually embed, join the team that's going to be the early adopter, try to get it working for them.

A great example, I'm going back to the Kubernetes example. So we had Chronos on this, scheduler on top of Mesos. Turns out operationally, a little bit abandoned, and both internally we weren't owning it super well, but also just in general, the project wasn't getting a lot of use at the company that had first founded it, and so there was kind of a bit of decay there. What we did is we had one team internally that used the vast majority, like 90% of the Chronos cron jobs. We went in and worked with them and migrated every single one off.

And this was actually amazing for us because we'd kind of gotten say two migration in one phenomenon, where actually by just replacing Chronos with Kubernetes, we had completed my definition, which is the only definition in the room temporarily, of a migration. And all of a sudden Chronos, completely deprecated, deleted, turned off, thrown away, forgotten, cursed. And then Kubernetes cron job, also cursed a little bit at that moment, there were some issues early on, it's gotten a lot better. But we had successfully done that. But then we were also in position to consider a second migration to actually move like stateless services, etc., on to Kubernetes as well.

One Easy, One Hard

One easy, one hard. When you are deciding to kind of see if something works, first get something easy to work. That's really important. Doing something easy needs to be easy, otherwise, folks won't adopt it. But a common mistake that folks make, and that I've made many times, is that then I do like a second easy one, a third easy one. And people are looking at my metrics and I'm like, I'm making a really great progress. Like I've done like 17 of these really easy ones. And then you realize when you get the really hard ones that you actually can't. And then what you do is you reverse the migration of those like 50 easy ones you got moved over.

The goal here is not to get finished as quickly as possible right now in this stage. The goal here is to make sure that finishing is possible. So you want to do an easy one, then a hard one. If the hard one doesn't work, this is like something to celebrate, because now you're going to reverse only one easy implementation, one easy integration, instead of having to do all of the easiest ones up to the first one that fails. You can just save, literally, years of your life, but also your users' lives. And if you do this well, then you don't get that momentum of kind of failure, where a user stops trusting you because you failed a migration. Only one user will stop trusting you because of this migration. And, you know, in a large enough company, you'll have enough, that that's fine. Sometimes.

And so at Stripe a good example for this is our MongoDB upgrades. So Stripe was for a long period of time the world's foremost expert in an old version of MongoDB. And I think we literally knew more about MongoDB than the people who had implemented it. We just had such immense skill with this one particular version. And so when we decided to upgrade, we were a little bit stressed about it, right? Like, this is all of our data. This is really important data. It's enough data, like, everyone's like, "Well, didn't you have a backup?" Well, of course, you have a backup, but restoring the backup when it's large enough, just moving it over the NIC takes a huge amount of time. So it's not just like "did you have a backup?" Maybe, are you running two copies? Do you really want to spend like twice as much money? There's like some tradeoffs, right?

So when we decided to do this migration, first, we did the simplest thing we could possibly imagine. And we found like a marketing website that had a couple of rows, very little. They went down, it'd be bad, but we could restore it, small data set. And, you know, our users wouldn't be too impacted by it if it was down for like an hour or two. We got that working. Then we went for the very hardest, most sharded, most terrible, like weirdest access pattern data set we had. We got that moved over. It was a little bit hairy. We rolled back and forth for a little bit but we got it fully working. And at that point, we actually knew, we get everything else in between to come over.

If we had just worked up, the easier and easier ones, we could have potentially gotten to this point where it just failed. And we did have to roll back a number of times as we ran into different kind of scalability issues with the new version. And this made it that safe, cheap, easy to do. De-risk. So now you know that what you're doing is first, worth doing. And second, like your approach is viable, that it might even work. That's exciting. So now the next thing is figuring out what can you do to actually make this migration easy.

User Testing

User testing. Literally, the most important thing. It's just like when you see a small company, and they're like, "Ah, like we built this amazing thing but no one wants to use it." You got to get to the user soon. You got to actually test the migration. So at this point, an important distinction is you're not testing the product, you're testing the adoption of the product. You're trying to make that conversion as easy as possible for folks who want to cut over to it. Test your interfaces. Get folks to try to actually use your interfaces to solve their real problems. Watch them struggle with it. Like watch them get angry with your interfaces. This is how you learn, right? Like, watch people try to use your interface. And this gives you a rapid iteration loop. Got to do it.

Documentation

Often, documentation is forgotten. Documentation, I think one was like very important in kind of Stripe's early success and continued success by having good, great documentation that folks could easily use. But in every migration, if you give people documentation that works, then they can actually run the migration on their own timeline, right? They can do it when they want to, when they have a couple of hours, not when you happen to be available to partner with them on it. But the only way you know if your documentation works is to actually sit down and watch someone try to use it to do the migration.

Also your emails, it's the same thing. A lot of times, email is like the worst thing in the world, right? I can't imagine how much time I spend reading and writing email right now. It's a lot. But if you don't test your emails, then you're just wasting people's time. So actually try, A/B test your emails. Try, get a few people to read it. Do they know what the call to action is? Just like, is this glamorous? Is this exciting? Not necessarily, but this is what makes it work. And sometimes I think that's the weird thing about getting further into your career is you're like, "Where's like the really hard technical stuff?" And sometimes it's like, "What's the stuff necessary to get it done successfully? And where's the quality in everything, the quality in the email, the quality in the docs, not just the quality in the code," right?

Operations

Operations. I think this is a classic one as well, it's easy to cut over. But actually, once you're cut over, people can't actually use the damn thing. And so making sure that people can actually use the system. I think chaos engineering is a great idea here, where we can actually force people to get comfortable with the operations of it by actually injecting fault early on so that they have confidence in it before they're using it, or at least confidence in it before you're like "Pretend it's done."

Debugging

Debugging is the same. Inject faults to actually force people to get comfortable with it. It's so often that you kind of have the tooling to get someone over, but then they lose the ability to debug the system.

And this is bad because all of a sudden, people can't tell if there's a problem with their cutover, where they might have, like missed a flag or something when they migrated, or if your actual underlying system doesn't work properly. Then immediately, people stop trusting your new system. And that creates all this friction for other teams adopting because they talk to each other, strangely enough, and they will learn from other teams who already adopted your new system that it doesn't work. And it actually does work, and it's just a misunderstanding. But you can never convince someone who's decided that your product isn't reliable or has errors. You have to just make sure they have the information to understand that early on.

Slow down To Go Fast

Kind of the theme here is slowing down to go fast. I think, again, to the example of doing many easy things early to make it look like you're getting momentum. You're trying to figure out what are the tools that you need to build to get to 100%? You're not trying to get as much velocity early on. That's not what's important. What's important is figuring out how can you do the work? How can you build the tools, the documentation? How can you evangelize to get to the place where you'll be able to get to 100% as quickly as possible?

Self-Service

Self-service is a great example of doing this. So oftentimes, if your documentation is not great, but even if your documentation is phenomenal, you, the team running the migration will become the bottleneck in the actual throughput. And this is bad because you have an organization that maybe is really ready to just run and fully cut over. But you're just in this kind of toilsome pit of disaster, just helping individuals, answering tons of questions, you're in Slack all day, responding to these questions, you're like, "Ah, they're in the docs." And you just haven't kind of done the pre-work to make it possible for folks to solve their own problems. Self-service, get out of the critical flow, let people solve their own problems, if you can.

Automating the Migration

Automating the migration. The very best migration is the one that no one does any work for. Often, this is not thought of enough. I think a great example of this is Sorbet. So Stripe, over a million lines of Ruby code. Ruby is not typed. And Ruby is maybe a little bit magical sometimes. Maybe there's some of that. And so what we've been doing is trying to roll out typing into Ruby, kind of a gradual typing strategy using a tool that we call Sorbet that we're in the process of open sourcing. And changing these 100 million lines of code would be impossible. But honestly, getting like our like 300-ish engineers to do all this typing work is also impossible because they're really busy doing like really important, valuable stuff.

So what we did here is almost all of this migration, it's just been a series of scripts that rewrite the abstract syntax tree programmatically and kind of commit that code. There's a great paper from Google ClangMR about how they do this at a much, much larger scale. But just this idea of like how can you get out of humans doing the work? Often you kind of think, "Ah, if we just have like 600 people grind through this, we can totally get there." But often if you just have one person work on a script that just rewrites code programmatically, you can actually just skip the entire migration. Also you get to skip debugging like 7,000 typos of people just fat fingering something.

This is so underrated, and I think any migration you're starting, and if your company doesn't have these tools, Codemod is like a great example of one of the tools. But these can just save, literally hundreds of hours or hundreds of years of engineer time if you use them well.

Incremental and Reversible Tools

Incremental and reversible. So another classic thing is you have a deadline for this migration. And people, it's usually a Friday because we just don't really know to time things very well. And so someone cuts over to this brand new system at Friday at four, or five, or six or something, right? And then there's this outage.

And if we don't give people kind of reversible tools where they can just revert it, all of a sudden you're gonna expend your Friday night or your weekend debugging this. But if you give people the ability to reverse the migration, then they can try it, and if it fails, they can pull themselves back. They start to trust your migration, create this like psychological safety in the migration where they actually believe that you're trying to support them instead of just forcing it down their throats.

Incremental is also valuable because it lets people chunk up a little bit at a time. Lets them do something they feel comfortable doing or something they have time to do. A recurring theme of these migrations is folks are just like doing a lot of stuff. They're super busy. And it's rarely that your end user is like, "What I really need you to do is migrate your Cassandra cluster" or something. Like your external users don't care really. So often these happen kind of in the boundaries, in the shadows, in like the 20%, 120% time. And so making sure that people can do it a little bit at a time is also really valuable as well.

Dark Launch

A great example of kind of the reversible rollout is the dark launch. This is how you often launch kind of the user-facing products, but also for infrastructure, libraries, migrations. Having the ability to just instantaneously switch a feature flag or a config setting somewhere to go between the new and the old makes it really safe for folks to de-risk what they're doing. Again, building confidence, because you're not just building confidence for people to do this migration. You're building confidence for them to come with you on the next ride you need them to take a year from now, right?

And finally, interfaces. I think like interfaces are a little bit magical because they're really not obvious. Like there's lots of rules of what makes a good interface, lots of rules of what makes a terrible interface. But great interfaces that correctly encapsulate the problem domain make these migrations so easy, and bad interfaces make them just very, very hard. A really good example that we've struggled and thought about a lot at Stripe is Mongo has some weird atomicity properties, I guess, is what you call them. And users have to build things like the walls around for durability, depending on the different write consistencies they want.

And it's pretty subtle, but there's actually ways to lose data where if the master, where the primary becomes unavailable before the secondary replicates, can actually lose data from Mongo, depending on your consistency levels, etc. And this is super toilsome for folks to then have to wrap these, write-ahead logs all around. And it's pretty subtle but that's just one example of how a interface that is just a little bit wrong, in this case it's just a little bit leaky, it means that folks have to do tremendous amount of work to actually get the proper behavior they want. A little bit of love, a lot of bit of love, a little bit of user testing, a lot of bit of user testing. This is how you kind of get through it to an interface that really makes it so that people can get all the way to 100% and not get stuck in kind of the weird edges the weird weeds of these migrations.

Finish Migrations

So we know what we're doing is worth doing and that our solution is viable. We've thought a little bit ahead about tooling to accelerate. Now we just have to finish. And finish is 100%. It's not 99%, it's not 99.9995, it's like not 17 nines or like 7,000 nines. It's not like some other team's maintaining the old system, so it doesn't really matter anymore. It's like 100%, how do you really get rid of the old systems? How do you fully replace the system, library or tool that you're trying to get rid of, and get the full wins and kind of technical debt reduction?

µContainer

Most important thing I've found for many, many migrations is stopping the bleeding. The best example I can think of this is this µContainer migration that we did at Uber. So Uber, whatever, five years ago is, 2013-ish I guess, the way services were provisioned, kind of like stateless services, etc., was significant number of Puppet changes, Clusto again, a high-quality Digg technology ported forward to Uber. A bunch of Clusto changes were added, ran some command lines. It would take between like 6 and 20 hours to add a single service because there'd just be errors, there'd be miscommunications, folks would want this, and then you kind of do the wrong thing.

And, you know, 6 to 20 hours of grinding work, you know, whatever the SRE team is doing it. So that's fine. But then it turns out then people wanted more, and say maybe there were like three of us, and say, people wanted like three to four of these per week. Now all of a sudden we're spending most of our time provisioning services, which is, literally no value to the company for us to be grinding through these, broken Puppet configs, right? Deploying them and breaking something and, ah, it's not really that great. But the really scary thing about working at Uber at that time was that the company was 4Xing headcount year over year. We were like, "This is really bad." And we were like, we're looking forward, and it's going to get way worse very, very quickly. And so what we did there was we did this rollout, basically moving from Puppet over to using Docker for all of that configuration.

And we did it in a way where it was fully self-service, where every new service required zero interactions with us, the SRE team, to provision. And this was pretty amazing, because first, we weren't spending all of our time doing this. But second, we got to spend all this time that was now freed up kind of doing the migration for the existing backlog, and that was amazing. And third, our metrics looked amazing because like every day, at certain points there would be like 20 new services getting provisioned each day that we had no involvement in.

So it looked like we were doing a really amazing job of migrating everyone forward, but actually we weren't doing anything. It was just like new people were drowning out the existing services and that was like pretty cool. Slightly deceptive, like maybe, as a pro tip, try to use the absolute values instead of the percentages. Percentages do lie. I've been told this is where the talk gets boring. So now you just need to imagine there's like seven or eight cat memes just kind of injected here. So just like picture the cat in your head, you're excited, you're energized, you're not bored anymore. Awesome.

Tracking

Tracking. So this is, again, the category of things that's actually not very interesting, but you have to do it, because if you want to actually get the value out of it, if you actually want to reduce the technical debt, you actually have to finish the migration. And a big way that you finish is building this metadata about what is or isn't done. And this is filing JIRA tickets, but not by hand, this is building the tool that does this for. Don't do this by hand, your life's too short. So your project managers' lives are too short. Everyone's life is too short. Build a tool.

But then this is this really amazing metadata that you can use for everything else. So for example, reports. When you're running a migration, for example, this µContainer migration, we needed to get initially we were like 50 services, by the time we fully finished, there were over 2000. So it was pretty chaotic time to try to track all of this I was, this was like a year, by the way, 50 to 2,000 in a year. It was a really interesting year. But you need reports to figure out where it's going well. Because you have this metadata on a per-service basis, you can actually figure out where the migrations are failing, and actually build the cohorts. Just like you would, again, for a product development, can build the cohorts of which teams are struggling, which types of services are struggling, can actually do the analytics to understand what you might need to improve in your interfaces, where the interface is a mismatch with the actual needs of the user, where the users are just too busy to actually spend time with you on it. Reports, not that exciting. But, honestly, if you want to be moving infrastructure, if you want to be making sure your company is the most productive company can possibly be, and particularly not a company with zero productivity, that is rewriting everything from scratch, and you're going to go home without a job [inaudible], tracking reports are surprisingly important.

Nudges

And, then you get to nudges. A lot of times, when folks do a migration, they do it very top down, it's like “You have six weeks to do this. And you're a terrible person, and our CTO is going to yell at you if you don't finish this on time.” Turns out, that works a lot of the time, but you can also not alienate all of your peers forever. And one way to do that is this idea of nudges. So how do you give people enough information that they will be motivated to do what you want them to do without telling them they absolutely have to do it? For example, a different way that we use nudges at Stripe is for cost accounting around AWS bills. We're like, "Hey, you're spending a bunch of money, it's cool, but you are the biggest spender of money. Your peer team spent like half as much." It's just like a little context, and people all of a sudden realize “Maybe I'm like a little bit mis-calibrated.” You didn't tell them you're doing anything wrong. You're not gatekeeping, you're just giving them a little of information.

Migrations are the same. So like, "Hey, you know, you're the last person. But it's cool. It's cool. We get it. You're doing some important stuff. But it's pretty late. Everyone else finished like 16 weeks ago." And just a little nudge, a little information. And the most important thing here is actually, when teams aren't doing a migration with you, it's not that they hate you most of the time. It's actually that they've been prioritized by their leadership to do something else. So by giving them these nudges with this rich context of here's how it's going more broadly, they can take that to their leadership and talk to them about it, and get them to reprioritize.

Finish the Migration Yourself

The last step of migration is finishing it yourself. Oftentimes, people don't want to do it. They'll be like, "Ah, it's not my code. The teams need to take responsibility for the migration and finish their own work." But, you know, just jump in and do it yourself. I think the best example of this that I've seen kind of my entire career was this vendor migration we did at Stripe. We changed observability vendors. We had to do a full kind of rip and replace of all of our metrics and all of our dashboards.

We also had an expiring vendor contract that was going to be super expensive to renew both of them concurrently. Also, we would have looked a little bit clownish if we couldn't have gotten it done, and rule one is try not to look clownish. Maybe rule four, I don't know. But actually, what we did is we got most of the teams to move themselves, but we got up to the deadline and then the observability team jumped in and just did all the migrations themselves. Turned out to be actually a pretty great experience for us. We saw a bunch of common mistakes in the dashboards. We saw a lot of teams that didn't know about new features that were available. You got them cut up, upgraded, adopting more of the platform. But we also got done. And that's the most important thing. We got to 100%.

Celebrate When It’s Over

The last piece of finishing is celebrating when it's over. There are two different types of celebrations for a migration. The first is when you start, and this is for your users, because you're saying like, "Hey, we're actually going to do it." You're trying to convince them to get on board and that you're not going to bungle this one like the last one. So that's for them, but that's not for you. You only get to celebrate when you finish, and this is so important, and this is a cultural touchstone that you just have to set as a company.

If you don't, people will start migrations, and they get all this credit like, "Ah, I did this huge upgrade," but they switched teams. And all of a sudden they're on this other team starting a new migration, and they're actually, you create this perverse incentive, where your best engineers at your company are spending all of their time generating technical debt at scale. And you have to set the cultural expectations, or you'll just be in a huge hole. And so many companies just don't get this quite right but it's incredibly valuable if you do.

So we're done. What was the point? Technical debt, the most important constraint on your velocity. Migrations are the only way at scale, to manage technical debt. And the only solution, the easiest solution, the obvious solution: treat every single migration like a product. Thank you.

See more presentations with transcripts

Recorded at:

Dec 12, 2018

Will Larson

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?