InfoQ Homepage Presentations Strangler Things: How to De-risk Legacy Code Migrations

Strangler Things: How to De-risk Legacy Code Migrations

View Presentation

Speed:

43:40

Summary

Shawna Martell discusses a case study in which they disentangled systems with no customer impact and zero downtime, how they prioritize feature migration, tooling, and backwards compatibility.

Bio

Shawna Martell is a Senior Staff Engineer at Carta, Inc. Her previous experience includes Director of Software Engineering for Yahoo's Big Data Platform, and she was one of the original engineers on Wolfram|Alpha. She holds an MS in Computer Science from Syracuse University and an MBA from the University of Illinois.

About the conference

InfoQ Dev Summit Boston software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Martell: Most of us deal with some amount of legacy code. We work at a company that's existed for more than a week. If you don't by chance, the cool thing is the code we're writing today, that's the legacy code of tomorrow. Don't worry, you will if you don't quite yet. It's really easy to hate on legacy code, because you stumble through some old piece of something and you can't believe it. It uses some framework that came over on the Mayflower and you're like, what were these people thinking? The other interesting thing about that code is that's the code that got us and our companies where we are today. Oftentimes, it just works, like we haven't touched it, because we haven't needed to. That's not all bad. Inevitably, though, we find ourselves at this place where the code that got us where we are, can't get us where we're going. Then comes everybody's favorite task. We get to update some piece of legacy code that's running in production that our customers expect to work. Who here has pushed a change that caused an outage? It's incredibly stressful when you take down something that people are using. That's the risk I want to talk about. Because we, as engineers, when we're working on something, and we have this intrinsic fear, maybe it's just me, of taking down production, that doesn't help us do our best work. When we can mitigate that risk, and we understand the tools at our disposal to help us in the inevitable chance that we break it, we're going to do better work.

I've seen a fair bit of legacy code in my career. Before I was a senior staff engineer at Carta, I was at Yahoo. Also, I was at Verizon. Think about Verizon? That is a company that has functionally existed since the invention of the telephone. They had a fair bit of legacy code. They've been around for a while. Carta is not nearly as old as Verizon. We're only a little bit more than a dozen years old, but we have our own share of legacy code too. When it comes time to do these upgrades, we have a few different choices. We can rewrite the code in place. We can do the one that gives me the most anxiety, which is stand up the new thing, and then one morning, switch over all the traffic and pray. Or we can gradually replace that code over time using something like the Strangler Fig pattern. What I want to walk you through is a real-life example of a legacy code update that we did at Carta using the Strangler Fig pattern. I want to take you through that with this lens of risk mitigation. If you're familiar with this pattern, you're like, isn't Strangler Fig from the Late Jurassic. This is not exactly the bleeding edge of technology. You're not wrong. Martin Fowler's blog about this came out in 2004. If you do remember 2004, maybe you remember that is the year that Facebook came out as the Facebook. Lord of the Rings: Return of the King won best picture. We got the Nintendo DS and the Incredibles in the United States. This wasn't exactly last week. This pattern actually gets more useful with time, because we keep writing code, and that means we keep making legacy code. Tools in our toolbox that help us mitigate risk when we have to do these inevitable upgrades, these are really important.

The Strangler Fig Pattern (High-Level Principles)

Often, you hear Strangler Fig in relationship to decomposition. I'm going to talk about decomposition. That's not actually the most interesting part of the Strangler Fig pattern. It's good for that. No question. The thing I love about the Strangler Fig pattern is that it gives our engineers a safety net. When we have that feeling of safety, we write better code and we serve our customers better. The Strangler Fig pattern is named after a real plant called the Strangler Fig. It's this viny thing. There's a tree underneath there. What happens with these plants is they start at the tops of the trees and they slowly grow down into the soil. Eventually, they actually kill the underlying tree. They're a parasite. The Strangler Fig now replaces the tree that used to be there. The Strangler Fig pattern is trying to do basically exactly the same thing. We want to replace our legacy system by incrementally increasing the functionality of our new system until one day, our old system doesn't exist anymore, and it's been entirely replaced by the new one. Let me walk you through the really high-level principles that come along with this pattern. Then we really are going to look at a real use case. When you're using this pattern, the first thing you typically do is write what the pattern calls a facade, I'll often call it a proxy. It's basically a piece of software that lets you decide, for this given piece of traffic, should it go to the legacy system or the new system? When you first stand up your facade, it's going to be super boring, because your new system doesn't do anything. All of your traffic is going to go to your legacy system. The next thing you're going to do is tease out these individual modules that you can independently migrate in your legacy system. This part, in my opinion, typically is much more of an art than a science. You have to figure out what are the individual pieces that I can reliably move independently without destroying the world. Then you start moving them.

Module 1 and module 2 in this hash mark situation over here, they still exist in the legacy system. We're not actually touching that implementation at all. Now our new system also supports the behavior of module 1 and module 2. Our facade when it gets a piece of traffic, it's been informed, ok, is this traffic using module 1? Cool, send that to the new system. It's using module n? No, send that to the legacy system, because the new system doesn't know how to handle that yet. Eventually, after you move these things, one at a time, you don't have anything in your legacy system anymore. You can decommission it, delete your facade, probably have cake. You've done the full legacy migration to your new system. The facade is the thing about the Strangler Fig pattern that I am absolutely obsessed with, because that is the thing that gives us this safety net. When we inevitably introduce some bug into our new system, and we will, we don't have to fix it immediately in order to mitigate it for our customers. We can switch them back to the legacy system, because we haven't touched that. When you can do that, and you know exactly what you need to do when you break it, that gives you this incredible freedom to actually innovate in your new system.

Case Study - HR Integrations

It's fun to talk about things that don't exist. I want to walk you through a real time that we actually changed the underlying implementation for an active running system in Carta. I'm going to talk about it in relationship to HR systems. You're like, why in the world does Carta give a hoot about HR systems? I'm going to explain it very briefly. If you're not familiar with Carta, one of our business lines is we manage cap tables for private companies. When you're a private company, and you hire new people, or you terminate people, that impacts your cap table. You could come into Carta, and manually manage that through a stakeholder management module. You've already informed your HR system of those changes. We integrate with your HR system to make that stakeholder management more automated than it would be otherwise. We've done this actually, for a really long time. It was running in our monolith, and it was running pretty well for many years. How I said at the top, sometimes the code that got you where you are, can't get you where you're going. This was mid-2021, that we found that was exactly where we are, that HR integration support can't get us into this new business line that we want to support around a compensation product. A compensation product cares about who your employees are. Unfortunately, the existing implementation couldn't be extended to this new business line. I'm going to explain this more later. This code had been around long enough, untouched, we were generally lacking a lot of expertise in how it worked anymore. The team of us that was responsible for fixing it had literally never worked in that code base at all.

Let me show you how it started. We're going to see how it works. This is the stakeholder management module I was talking about. When you look at this picture, it's not wildly unreasonable. It seems kind of ok. This integration management UI I point out only because it exists, not that it's very interesting in this use case. We want to introduce this new compensation product. That seems fine, except for the part where that product was only going to work for some of our HR providers. This is a real thing that happened. You ended up in a conversation with a customer who was interested in this compensation product we were thinking about rolling out, and we have to say, yes, we support your HR provider just not like that. When you say that out loud, it sounds bananas. That's because it was. We had to do something about this. I have this nice box around the HR integrations, like it was some unified module that worked together. It had grown organically over many years. Really, it was more like this. I show three, there's actually way more than this. Don't get hung up on these names, because I didn't bother to go look and see what was actually in the monolith at the time. I was just like, those are definitely HR provider names that are on our website. We have these independent things that are consuming HR data in such a way that we can support stakeholder management, but this guy is out to lunch except for on alternating Tuesdays. What are we going to do? We weren't going to touch that because it was way too scary. We had a very small amount of confidence that we could really rewrite this in place, and not destroy the whole world and have it do the right thing.

I'm not going to walk you through all of this. It's different. You can tell that it's different. There's a message bus in the middle. The two big differences were that we wanted to move to something event driven, hence the message bus. We wanted to move out of the monolith because we wanted to be able to move more quickly. This was a very exciting idea, but it's pretty far away from where we started. The other thing that was really important, if we were going to do this decomposition, we had to make sure that we fixed that consistent data contract problem, because we did not just want to recreate our existing issue in a new system. You don't make anything better when you do that. If this is where you want to go, and you saw where we started, what did we do first? I'm going to talk this whole time about how we were really focused on risk mitigation, because we were really focused on risk mitigation. The first thing we did was we stood up this new service that was connected to literally nothing and had no functionality at all. We did that because you can't break anything in a system that nobody knows exists. We had not done a lot of work in deploying new services in our infrastructure, so we decided we were going to do that first and make sure that we knew what we were doing. The one piece we did put in there was a layer of provider abstraction, because that was going to be the thing that ensured we maintained a consistent data contract. This alone does literally nothing interesting. I said we use Strangler Fig. I also said the first thing you do is your facade, so I only lied to you a little bit, because the second thing we did was our facade. We didn't build something new. It's very common that you do have to introduce a new layer, when you need this facade. We decided to repurpose our existing APIs and make them the facade or proxy. When you think about traffic routing and stuff, it sounds complicated when you say I need this thing that routes the traffic.

I want to show you how this actually played out in our code in practice. These are the only two slides that I use ChatGPT for in this talk, but it did a relatively good job. I asked it for a plausible function because I wasn't going to actually put our source code up here, because it's actually too complicated. It wasn't that interesting. Let's pretend that this is a pre-facade implementation of our code that fetches a customer's HR integrations. Not only do we need to consume your data, but you have to be able to manage like, today, I talked to Rippling, but I just moved to Workday or something. Let's pretend this is the code that tells us what your existing HR integrations are. It's pretty simple. Get the customer by the ID that came in. Do a little bit of error handling. Then call some function that knows how to package this data, and return it back on the response. Now we want this layer to be our facade. How much does it have to change? Not very much at all, is how much it has to change. What do we do here? The beginning part outside of the green box is literally exactly the same. Please believe me, that's true. In the green box, the first thing we did is add a check into the database for this new flag that we created that's like, is this customer configured to use the new service? If it is, call some new function that knows how to interface with the new service, package that data, and return it. Are you not configured to use the new service? Great, just call the function you called all the time. The totality of our facade changes were like three lines of code. This is a tiny bit of oversimplification, but I think in reality, it might have been like six lines of code. It was not complicated. It's actually better if it's not complicated.

Now we have a facade that can direct traffic to something that doesn't do anything and literally nobody is configured to use. Now we need to find those independent modules. Sometimes this is really hard finding these independent modules. The nice thing for us was it was really easy. Those HR providers, the implementation that existed today, were basically already independent modules, so we just picked one. Because we were so focused on risk mitigation, picking your first thing to move doesn't necessarily have a single right answer. We chose the provider that had the fewest customers, because that's how you upset the fewest people is if you use the thing that has the smallest footprint. We didn't actually change, again, anything about the legacy implementation of this provider, we just introduced the functionality in our new service. Now, once this was done, we could have a customer running either on the new service, or the old service that was using this first provider that we implemented. We did a bunch of testing in our non-prod environment. Then I was like, let's turn it on. Let's see what happens. I'm going to talk about tooling to make this easier. We had none of that to start, we just wanted to fail really fast. We sent some cookies to our DBAs and said, would you please run these queries in production? They were like, are you sure? We're like, yes, we think so. I waited for something to explode when we moved our first customer. No one was as surprised as I was that it just worked. We actually never had to move that initial customer back to the legacy system. Did we pick a customer that had the fewest employees? We sure did. The least interesting data? Absolutely, yes. It made the team feel really great that we got one customer moved. Yes, we only had a zillion to go, but one was infinitely more than anyone had ever accomplished ever.

Tooling

You're probably like, you said that risk mitigation thing, and then you were also running manual database queries in production, because those don't sound the same. That is correct. Manual database queries in production, I think the technical term are fraught with peril. We built ourselves some tooling. In my experience, sometimes like this customer migration tooling, it can be a little bit controversial when you're talking about folks about needing to do these kinds of migrations. Because they're like, isn't this just throwaway work? Once all of these customers are migrated, you're just going to delete this tool. They're not wrong. When you are doing this type of gradual migration from some old service to a new one, you need some tooling to manage how your facade routes data. It doesn't have to be a nice UI, but it's pretty great if it is, because you can just look at the tool and understand exactly what's going on. You can also use feature flags or system settings, there's no real one right answer to how to do that sort of tooling. It basically has to exist. The other reason I want to call out this tooling in particular is that the granularity of your facade routing is really an important part of when you're building this safety net of understanding how you're going to mitigate risk. We talked about at the beginning of this whole migration, maybe we're just going to move all of the customers on a single provider at the same time. We'll have the provider stood up, and we'll move everybody. Then I played this out in my head, where 90% of them break, and I crash Slack because of all of the notifications that some things are broken. I can't sort out the errors from the success. It devolved in my brain very quick.

I'm an eternal pessimist, but I was probably not entirely wrong. We instead decided to move individual customers. That let us move somebody and wait and see if anything blew up before we had to move somebody else.

We also built another tool. It's one thing to migrate the customers who exist today. Also, this is a live active running system. I've got new people coming in right now hooking up their HR provider. How do I decide if they move into the new system or not? This was where we decided that we were going to let that go at the individual provider level. I wanted to call this out in particular, because when you're thinking about risk mitigation, you need to understand the risk profiles of your different use cases, to make sure that you're routing that traffic correctly. Our risk profile for new customers who had no expectations of what this tool was probably going to do, we were more willing to accept risk with them than we were with our customers who use these tools every day, and if we broke them, they would probably call someone who would then come be unhappy with us. There's no one right answer to this, but it's an important consideration when you're thinking about how you're going to route your traffic in your facade. Does that mean I'm actively creating data in a thing I'm trying to delete? A little bit, yes. Almost certainly you're going to be in that situation when you're doing these sorts of incremental migrations. This was another thing that was very controversial. In engineering, we were like, we have the tooling that eventually we will be able to move everybody. We're ok with this. We're ok with some amount of customers functionally creating their own tech debt. Our product partners were, but like, this new system is faster and shinier, I want more customers on that more quickly. It was a conversation. It was a discussion of tradeoffs. We did not feel in engineering that we could effectively move all new integrations to the new service at one time, and manage the inevitable errors that would arise. Sometimes that is a tradeoff worth making. It is something that you often have to have a conversation with your other teams.

Backwards Compatibility

Now we can move customers into the new service. We can implement new providers. We can have folks create integrations in the new service. It must just be like, turn the crank, and we migrated everyone and everyone clapped. Not exactly. This was huge progress. I've talked about like getting new integrations into the service and migrating existing customers. Did you notice I didn't talk at all about getting data out of the thing? There's also some complexity there. Backwards compatibility is really important. I need these tools to keep working. I also mentioned that we used our API layer as our facade. This red circle is intended to denote the lack of an arrow. Our stakeholder management tool didn't actually use the APIs that powered other projects, for hashtag reasons. It was making direct database calls into the monolith database, which didn't have this new data in it, because we hadn't given it any way to do so. We had to make some decisions. What were we going to do? The thing I really wanted to do, I wanted to make that arrow exist. I wanted to go into the stakeholder management tool, and encapsulate the functionality that fetched the HR data. Right now, it calls the database directly. I'm going to be able to change that once it's encapsulated to instead call the APIs. It's going to work great. Until we actually started looking at the details. Also, the folks on this team, we had literally no idea how the stakeholder management tool worked. As we dug into the details, in the interest of risk mitigation, we were like, rewriting this thing is probably a mistake. If that's not the thing we're going to do, how do we get data to this system? This arrow is glossing over probably 10,000 not super interesting details. The new service doesn't actually just require a database connection to the monolith because something about that feels wrong to me, but the data flow is effectively correct. We decided instead of making the stakeholder management tool use the APIs, we were going to have our new service get its data into the database where the stakeholder management tool expected it as if it had never changed, so that it didn't have to know the difference. We did a presto chango on the stakeholder management tool. We never had to change its implementation of anything. The data from our new service was available for it to read. Backwards compatibility when you're doing these sorts of changes, sometimes your facade alone is sufficient. Sometimes your facade alone is not sufficient. When it's not, sometimes you do things like these. Are they a little bit weird? I think they're a little bit weird. They solve the problem without us having to touch lots of downstream products. Sometimes a weird solution that works is actually a really great solution.

Multi-Year Migrations

Now we must be done. We can get data in. We can get data out, kind of actually. There's one other piece of this that I want to touch on. I said we started this in 2021. I'm going to tell you the truth. We're almost done, but we're not completely done yet. These things can take years. When you're doing a migration over many years, usually someplace in the middle, somebody comes along with some requirement that you're like, I wasn't ready for that right now. That actually happened to us, right at the very beginning. You can see, this is back to the picture where there's only that provider abstraction in the HR service. At the time, we had at least managed to get the new service stood up, but it didn't do anything yet. Our partnerships team came to us and they're like, "We're having this conversation with this HR provider. It's going really well. As part of this relationship, we want to build a direct API integration to their stuff. Can it be done yesterday?" Given our lack of time machine, that was not a timeline that we felt we could make, but the business need was like, as soon as humanly possible. We had to decide what were we going to do. Were we going to have this be the very first thing we ever built in this new service that had done literally nothing ever in the history of the world, when there was this important business need that was being taken down to us? Or were we going to go touch the legacy code that nobody in the whole world knew how it worked anymore, when we had this pressing business need that was being brought down to us? There were tradeoffs to both. In the end, we ended up implementing the new provider in the monolith. I built a not small amount of tech debt, but it was the best we had to do at the time. I don't know anything about partnerships or lawyers, but it's my understanding that there's lots of paperwork. Really important people have to make decisions, that in the end, the deadline slipped. We had enough time that we actually stood up the provider in the new service too. We never turned it on in the monolith ever. It sat there. I think we used it in sandbox once to convince ourselves that it worked. It never actually got turned on.

Why am I even telling you this story? Other than to admit that sometimes I write tech debt, that's a fantastic question. When you're doing these multi-year migrations, and I did a very scientific study where I asked a handful of people I went to college with, and by that I mean like three, if they had used a Strangler Fig and how long it took? Everyone was like, years. I'm like, at least it's not just me. When that happens, you're probably going to get these unexpected requirements. Then you're going to have a decision to make, are you going to create some tech debt, or are you going to try to stand it up in this new system? Where the only correct answer is almost certainly going to be, it depends. Sometimes you're going to create some tech debt. That's going to be ok. Sometimes you're going to create some tech debt, that in the end, you actually never turn on for, again, hashtag reasons. Those things are all fine, and they happen. As long as we're making really conscious decisions when we're going to embrace tech debt, I think it's super reasonable.

Eventually, we did get to the point where it was just like, run the assembly line, move the provider and do some testing. We did, in fact, actually have thousands of customers. We eventually augmented our staff tooling so that we could do bulk migrations as our competence increased in our new system. We must be done? It probably looks like this now. Not exactly. It looks like this, approximately now. We don't have this message bus, total events thing completely figured out yet. We still have some APIs that power our compensation product. We've made huge progress. Standing up new providers in the new service now is like days' worth of work, as opposed to question mark worth of work, in our previous implementation. We've implemented a bunch of new HR providers over the last couple years. I know for a fact we've been able to do it much more quickly in our new service than we would have been able to in our old one. It's not exactly like the picture. Sometimes, though, in the immortal words of Mick Jagger, you don't get what you want, you get what you need. This isn't the events driven thing that I had dreamed of in its totality, but it is a lot better than what we used to have. We solved our customer confusion problem. It's no longer like, depending on the phase of the moon, and which provider you have, what parts of your tools we support. It just works. We do still have literally one integration running in the monolith. Actually, the team has been talking about trying to get it done. Even though we aren't 100% done, or we might be, I think that this entire process has been a really huge success. Because we did all of this work, and our existing customers that we had to migrate, they never experienced any interruptions to their workflows. They were able to use our product the whole time. They actually never had to know that we changed the entire underpinning of everything that they relied on, like a giant magic trick. It was transparent to them. I think that that is victory when we can upgrade our customers and they don't have to know.

Lessons Learned

What did we learn? What went well? I want to go back to this derisking part because it's the part that I think is the most interesting. You could imagine maybe that the thing we did instead was stand up the new system and implement literally all of the providers, and in the morning, one day, move all of the traffic and hope for the best. I just can't imagine a universe where that actually worked. Something definitely would have broken. My guess is so many things would have broken, it would have been almost unmanageable to try to deal with the fallout. This is why we wanted Strangler Fig. We wanted Strangler Fig, because it allowed us this incremental change. It was easy to reverse. Nobody wants to push code that's obviously broken. We all break things. That's just part of doing our jobs. We wanted our engineers, and by our engineers, like I also as part of this team wanted to know that when I inevitably broke something for some customer in the new system, I could put them back to the old system, and give myself the breathing room to fix it. That also made me feel like I could be more experimental, and maybe take different types of risks in our new service than I would have been able to, if I didn't have the safety net. The last thing we did was we instrumented this thing to the hills, because if it was going to break, and when it was going to break, we wanted to know it, and we wanted to know it really fast. We had Slack notifications and century errors so that we could respond quickly when there was an issue. We actually ended up expanding that migration tooling to also give us visibility into the health of all of our customers' integrations, which was something our old tools never provided to us. In the end, that throwaway work, it's actually going to live on because we've expanded its utility beyond just that migration work.

What Went Well?

I really love that we moved thousands of customers, and none of them had to know the difference, and their stuff just worked. Even though it took a while, I think that this is a huge win, and a really important reason to consider Strangler Fig, because you have this optionality to keep functionality working for your customers. Though there is a tiny bit of legacy stuff still in the monolith, we've been able to delete the vast majority of it. The other part about this implementation was, when we got started, we actually had a conversation, the team had a conversation at the beginning of, some day, somebody, and it could be one of us or somebody else, is going to be upgrading the code that we're getting ready to write right now. How do we keep those future people in mind? Because the life you saved may be your own. How do you make sure that this thing has the sorts of modularity you need in order to upgrade it in the future? The last piece of this that I think is really interesting is, if you follow our CEO on LinkedIn or anything, you know that Carta loves a reorg. He's super into reorgs. The manager on this team has changed, I don't even know how many times, but we have continued to make progress. The work has actually never been abandoned. I think that's because you can do this migration incrementally. The team doesn't have to have the full totality of the history of the universe in order to understand how to continue this migration. That means that we're able to make forward progress, even when the team changes.

What Could Have Gone Better?

It wasn't perfect. The long tail has been very long, like years long. When we put this together, I was like, we'll be done in 18 months. She was a person that said things out loud. That was kind of a lousy estimate. We're getting there. I think that we're pretty close to some definition of complete. This last thing is purely a Shawna thing. I wish we were more events driven. What's more events driven than like, we hired a new person, or like, they got a raise, or whatever that you get through an HR provider. I wish we were more events driven with this than we are right now. We're actually making progress. I think that we'll be much closer by the end of this year.

When Not to Use Strangler Fig

Maybe you're wondering, this thing that I'm considering doing, should I use Strangler Fig? That actually makes me think of a different question that a friend of mine asked me when I told him I was going to do this talk. He's like, when wouldn't you use Strangler Fig? I thought about it. I was like, I probably wouldn't use it if I were making a sandwich. It's probably not exactly the right tool for that job. I admit to you, I live in a world where Strangler Fig is my hammer and the whole world looks like a nail. This is the thing that I go to more than just about any other pattern. There are legitimate times not to leverage a Strangler Fig. If you have a super simple implementation, just don't bother. Just rewrite it in place, or do a one-time switchover. It's not worth the additional overhead for a sufficiently simple system. You might also be in a situation where your current implementation is so tightly coupled that you're like, identify independent modules. Like, I don't even know where to begin. Those are harder. The correct answer is also, in those cases, it depends. Sometimes the right thing to do is to go in and actually introduce the seams in that legacy code, so that you can then take it apart and leverage Strangler Fig. Sometimes it's going to be the right thing to do. Sometimes the right thing to do will be like rewrite as much of it as you need in place and pray for the next person that comes along to the code. It just depends on what your individual situation is. I keep coming back to this pattern. I use it all the time, not just for decomposition. We're actually working on a project right now, where we're recomposing some stuff into the monolith, which is a different fascinating story. We're leveraging Strangler Fig for that work as well, as we don't move things out into a new system, but we move things back in to our monolith.

Strangler Fig Pushback

If you want to use this pattern, you may come up against a few pieces of pushback. Some of those came up in our conversation. This migration tool, it's just throwaway work. It can be hard to convince leadership that that's something they should invest in. When I am forced to convince leadership of something, I usually try to take a customer first lens, because they often are super motivated by the idea of having lots of happy customers and very few unhappy customers. The incremental migration gives you a way to ensure a working workflow for your existing customers, even in the midst of a migration. This can be a good way to convince leadership to do this, "Throwaway work." Isn't it hard to reason about a system where part of it is in one place, and part of it is in another? Yes, it is hard to reason about. This is where observability tooling is so important. Our staff tools did a lot of this for us. It doesn't have to be a tool that you just build from the ground up, though. Sometimes, depending on your system, your existing observability tools will give you sufficient visibility into what's happening here. As long as you have some way to know what's going on, you can overcome this challenge. It's going to take too long. They're not wrong there. Though, I think at least it's getting done. I honestly think if we had tried to do this as one big bang rewrite, we would have given up at some point because the number of failures would have just become so overwhelming that the work would have been abandoned. The only thing that's longer than, I said 18 months and now it's been almost 3 years, is if it got done literally never. I have a good friend of mine who I'm sure stole this from somewhere else who says, slow is smooth and smooth is fast. This is how this plays out in my experience is that, yes, it's slower, but in the end, you actually get the work done. The last thing that I've sometimes had pushback about is like, introducing this facade, isn't that going to make my stuff slower? Now I've got this extra layer. It's true. It's hard to add some additional logic to some existing functionality, like literally for free. That's not typically how computers work. You can make your facade sufficiently simple that it is overhead that you can absorb. This is where you really want to make sure you have some amount of observability. You want to understand, how much am I impacting my workflow with this introduction of the facade?

Conclusion

I want to close with reiterating one more time how important I think Strangler Fig is in this idea of security for engineers as they're doing their work. An engineer who feels that psychological safety of, if I break this, I know immediately how to remediate so that then I can go debug it. They will write better code. They will move faster. They will make more innovations. I feel pretty strongly that, in the end, you will get a better product than you would have otherwise. That is probably my favorite reason for using this pattern.

Questions and Answers

Participant 1: Can you talk a little bit more about the simplified arrow you had between your new module and the monolith database, and the backwards compatibility piece too?

Martell: We had some choices of how we were going to decide to ensure that the tools that needed this data, we did not have to go fundamentally rewrite, because that was off the table. The high-level idea is that we introduced a new endpoint into the monolith that was able to accept this data from the HR service and put it into the database exactly the way stakeholder was expecting it. From this module's perspective, everything was the same. The way we made that happen was by giving ourselves a path to get the data into the monolith database via the existing monolith, but not existing monolith code. We did have to write some new code in the monolith. It was fairly lightweight.

Participant 2: I'm curious how you were able to convince leadership given that with this approach is generally like a long-term commitment into doing this migration, rather than just working on a legacy system, including the fact that, at least for that new provider, you started with that. What was the key point to be able to use this?

Martell: We had a few different angles that we took, but all of them were customer focused. We had a high degree of confidence that we were going to be able to design a system that in the future allowed us to more quickly integrate new providers, which had recently become a thing that our customers were asking for more. That helped a lot. Because the existing code had just grown so organically, we did not have a nice way to just plug in a new thing. We also were quite confident we could make our data syncs faster. That was another common customer complaint was just like, why is my data old? It was because we didn't really know how that code worked. We all have this problem. It gets slower over time. Nothing gets accidentally faster over time. The customer first part was what we really leaned into. The other bonus piece of this was, at the time, we had a different CTO who was like, if you can build outside the monolith, build outside the monolith. When I showed him a picture with the decomposed service, he was like, "Yes, go build that," which I also leaned into. That would probably not be what would happen at Carta now, but it was what happened at Carta then.

See more presentations with transcripts

Recorded at:

Jul 04, 2024

Shawna Martell

InfoQ Software Architects' Newsletter