InfoQ Homepage Presentations Moving Mountains: Migrating Legacy Code in Weeks Instead of Years

Moving Mountains: Migrating Legacy Code in Weeks Instead of Years

View Presentation

Speed:

50:10

Summary

David Stein shares how to rethink large-scale architectural migrations using AI. He discusses ServiceTitan's "assembly line" pattern, explaining how decomposing legacy codebase refactoring into standardized tasks can achieve massive parallelization. He highlights the critical role of programmatically rigid validation loops to eliminate LLM hallucinations and accelerate engineering agility.

Bio

David Stein is a principal AI engineer at ServiceTitan, leading work on agent evaluations — predicting performance before release and monitoring success in production to bring engineering rigor to AI deployments. He also works on upgrading ServiceTitan's data platform for use by both humans and AI agents.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

David Stein: I'm David Stein. I am a Principal AI Engineer at ServiceTitan. I'm going to talk to you today about everyone's favorite topic in software engineering, which is migrations. Migrations, I'm assuming most of you are engineers, you know what I mean by this. These are things that everyone knows that they need to happen to move a bunch of legacy code from one old system onto some new system that's going to be way better. They usually plan out to take months, quarters, years. We all have something like this, yes, migrations. I'm going to talk a bit about how at ServiceTitan we're using AI to be able to do these things in a totally different way and about what that means for us to be able to do migrations rapidly like we never could before.

ServiceTitan - The OS for Trades

I'll give you some context about what ServiceTitan is since some of you may not know already. ServiceTitan is the technology platform for contractors and the trades industry. This is residential, commercial, building services, construction, electrician, plumbing, HVAC, roofing, garage door, this large set of critically important functions that everybody depends on. ServiceTitan has this end-to-end technology platform to support this industry. There are a lot of really interesting AI use cases in these industries, which some of my friends who I worked with before they don't necessarily know like right at first breath. There are some really compelling things that we build, just to give you some background on what we do before I get into the details. The trades industry is really ripe for optimization and automation. We call it a historically underserved industry. There's a lot of pencil and paper and Excel spreadsheet type work that needs to be migrated into a much more streamlined, digitized way of doing things.

ServiceTitan is leading the way in that space. Everybody has some AI use cases right now. In the industry broadly, we see a lot of use cases that are borderline gimmicky in terms of what various people can do with AI. It's really great to work on some of these really down to earth, really essential services level use cases for AI. This includes things like job value prediction, to be able to schedule contractors and knowing which people to put into which trucks, send them to which homes and which buildings on which days, in order to be able to optimize for efficiency and revenue. This is a really important thing that our customers really love about our product. There are other things, like being able to listen to the audio of a call or being able to listen to a voicemail. If you run a trades business and you miss an important call that could be an important lead that could bring in a lot of revenue, you want the AI to be pinging your phone saying, call that person back, call that person back.

Effectively, it's a bit of an oversimplification, but that's a little bit about like what second chance leads is. Then there's other cases like other products that we build and sell, such as the voice agent. We have an AI answering service that can help our customers book jobs for their customers on a 24/7 basis. These are just some of the use cases that we're building at ServiceTitan and that I'm working on.

Large-Scale Migrations are Mountains

What I'm going to talk about today isn't actually about any of these specific things. It's about this more general challenge of what do you do with legacy code that is on an older architecture that you know you need to do something about. You want to adopt some newer platforms, some newer technology, and you know that it's going to be really hard to make that happen. Any of you who are working at companies that have existed for more than a couple of years, you all know what legacy code is and how burdensome it can be, where the more code you have that's been in existence for more than a few years, it's almost like the slower you can go to add new things and to change it. Everyone in the industry has been involved in some kind of migration project at some point. You also know, especially if you're in leadership or if you help plan these things, these projects are risky.

If you're going to set out on an initiative to move a whole bunch of legacy code into a new platform, you don't want to get like 25% of the way there or 50% of the way there. In many of these cases, it's really essential to get all the way there. That fact creates risk. You really don't want to get partway down one of these projects and then find out basically that you are stuck. Just to give a little bit more clarity here about what kind of migrations I'm talking about here, there are some other kinds of migrations, if you use this word generally, that have been easy to automate for a really long time. This is like migrating a database from one format into another format where you can write a script to do a push button conversion or translation. That's an easy case. That's a solved problem.

It's not what I'm talking about here. I'm talking about the cases where you have hundreds of thousands of lines of legacy code, some of which was written 5 to 10 years ago by people that don't work for your company anymore, and it may or may not be very thoroughly documented, very thoroughly unit tested. These are the kinds of things many of us have seen. The Herculean task of moving large chunks of that into a new abstraction is really what I'm talking about.

Case Study: Migrating ServiceTitan's Reporting Metrics

To make this concrete, I'm going to talk about a case study at ServiceTitan. ServiceTitan, a big part of our application is in the area of reporting. We make it possible for our customers to download and view and do a bunch of sophisticated analytics with all sorts of operational, financial, and other metrics about their business. There are things about jobs, invoices, technicians, customers, and the like. The machinery to support all of that stuff has been part of the application for a long time. Some of the pieces of that technology have been plugged into production databases in ways that are different from how we would have built those things from scratch if we were going to build them today. Once again, a hugely common problem, but the issue when you have such scale in terms of the number of metrics in our case is that each one of those has a chunk of legacy code that's powering it that may have a bunch of dependencies.

There may be hidden dependencies on how schemas are structured in the DB and what the data looks like. Working through all of that stuff takes time. In the old way of doing this, you have a massive project. You have hundreds of Jira tickets, you have hundreds of engineer weeks, and that is quarters to years.

The Hazard: Climbing the False Summit

Another key idea, like the boogeyman about planning or running migrations in the old world before AI is that migrations don't always work out. I've seen in my career, especially at ServiceTitan but also before ServiceTitan, sometimes they're very successful, other times they're less successful. In this case around reporting at ServiceTitan, there had been more than one attempt to refactor some of these legacy pieces and put them into a better architecture. This case study emerged out of a situation where the team or a chunk of the team had embarked over a period of time to try to unpack some of that legacy code, move it into a data lakehouse based new setup, ran into a set of challenges doing that, timelines were a struggle, and so on. You can imagine where this story goes. Then, what always really sucks is when you get many months or even years into a project like that and realize that the way it's adding up isn't what you expected when you set out on that path.

You start to ask like, was this even the right call? That's a bad position to be in. That keeps some of us up at night, I think, when we're planning projects like this. In our case, earlier, we were looking at one of these struggling cases where one of our teams that had been working on digging through this legacy code had been struggling. We started looking at the code that they were working on using the cutting-edge coding LLMs to see if there were going to be ways that we could do this in a different way, or do it in a better way in order to get better traction in terms of putting it into a new architecture. That's some of the context behind what we're doing here. The default solution to every problem in 2025 is, can you just ask AI? You have hundreds of metrics, you want to migrate them, you just put that in Cursor.

You can't just ask Cursor or Claude Code to migrate your entire legacy codebase into some new architecture that you describe. You can ask it to do this. I have actually tried to ask it to do stuff like this, and it doesn't work very well is what I have found. Maybe some of you have found similar things. Why doesn't this work? The reason it doesn't work is similar to why an engineer can't do something like that. Some of you are very good engineers, but you probably can't just go fix hundred-thousands of lines of code and move it into a new state-of-the-art architecture, because that task is too big. You might make some initial progress, but then you are going to be looking up a mountain and going to be struggling to understand all of the context surrounding all of that legacy code.

You might not have the context that you need. There's the issues of teammates left and so on that makes it really hard to do a lot of digging to make incremental progress on a project like that. Then if you are an LLM writing the code, you have hallucinations. You have the bot failing to understand the assignment, inventing new metrics instead of migrating old ones, doing five and then saying here's how I would do the rest and stopping, and then the five that it did aren't right. You see a lot of things like this if you just try to take a very naive approach, even with the best state-of-the-art models. The insight is, decompose the problem into small verifiable steps that you can actually do. That is not a new idea. That's something that we all know how to do very well. If you are a staff engineer or if you are an architect, you know how to decompose a problem into steps because we do that all the time.

Principle of Acceleration

What's different now and what we want to achieve now and what I hope I'll convince you is possible is to invert or rotate the timeline. The way that I would have had to have planned a large migration a few years ago would have been a heavy-duty investigation into some POCs. What are the available solutions? How do they map to our existing stack? Are they going to meet all of our requirements? Hope so. Make a decision, buy the software or pick an open-source solution. Get a team ready to go, and then embark on this bulk migration process that is going to take a really long time. Then full value realized being at the very end. These are just inherently very hard to schedule. What we can do now is do some extra work at the beginning, get really rigid about how you do validation, especially, so that you can know that each incremental step is doing exactly what needs to be done.

It's almost like parallelization. You can compress the toil that's required for those hundreds of steps down into a much shorter timeframe in order to get full value realized much sooner, and also to get signal around architecture agility to be able to know in a few weeks instead of in a few quarters if the new plan is not working. Breaking down what this looks like for migrating a metric, if you're an engineer on my team and you're migrating a particular metric from the legacy system into some new metric store system such as in our case dbt MetricFlow and Semantic Layer. That's your objective. There's some context that you need in order to do that task. You need to know where the legacy code is. You need to know what DB tables it needs. You need to know what the data really looks like. You need to know maybe distributional information about the data.

You need to then know what the target pattern is, the details around how the architect has decided that we're going to apply that new platform in our context. You need to then plan out which files you're going to add and change. You need to write the code. Then you need to hopefully validate that that stuff works well. You write tests. Then you release the output of your work. If you zoom out to the overall task, it doesn't matter if it's 5 steps or 500. You have this expanded scope here for an architecture migration where you have the architecture objective that fans out into these subtasks, subobjectives. It's still pretty simple. This is not new in the AI era. It used to be that you would brainstorm something like this and then you would put it into Jira. If you like Jira, or other tools, if you don't like Jira. This is the old way.

What you can do now, if you get really crisp about standardized context acquisition and standardized prompt and context for all these tasks, and really strong validation where you can really know for sure that each step is actually complete and get that with sufficient certainty, is you can assign all of these columns to bots and you can go way faster. I named this pattern the assembly line, which is not a new name, of course. I think that the key is that we've already been able to do step one and two for a long time, except we would do them a slightly different way. What we can do now that we have 2025 level coding LLMs is we can actually automate in this step three. This is the key idea. You're going to build an assembly line. We're not going to be assigning each of these tasks out to engineers over a span of really long time. We're going to be automating them. Decompose, standardize, automate.

I'll talk a little bit about each step here. As I said before, the thing you don't want to do is just say, how do you move the whole mountain? Because that's too big. What you want to do is you want to know with extra detail, how can you migrate one pebble out of the mountain? You want to break them down into a level where you, with some experimentation can show that an agent can actually successfully complete. It's possible that maybe the pebble is not the right idea. You can go too small. Similarly to how if you're breaking down the project for a bunch of interns, we break it down into a bunch of individual steps, steps that can be fully completed and validated independently. Then the most important piece, the step two is to have a script or an environment or a platform or whatever you need for your particular use case, that is really good at getting a pass or fail answer about an attempt by a bot to handle one of those tasks.

You could argue, it's important to do that stuff anyway. A lot of these ideas are things that in hindsight, I'm thinking, why didn't I always think about having us do these sorts of rigid things when we were assigning these tasks out to senior software engineers, and maybe that would have been better? When Joe or Alice puts out their PR, we would have a really clear, programmatically for sure that what they're doing is exactly what was expected of them. I didn't usually see that done in the past, but I'm finding that when using agents to make these massive projects happen really fast, it's really critical to do, especially when you want to be able to run huge stages of the migration in a really short period of time. You need to know that each step is working well, or you'll get off track really fast. In our example, we built a little simulator that would generate the same kind of reports that our legacy backend for the reporting would do, but it would build those reports using the new metrics platform.

It sounds maybe like it might be easy to do, but there's actually a lot of details in there. You want to make sure the formatting is right. You want to make sure the data is directly comparable. You want to make sure that it's really easy to get a pass, fail. Like, yes, it's right. The data is correct. You can scan over the whole distribution. The numbers are right. The formatting is right. Having this physics engine where you can try out what the coding agent gives you is the critical piece. Then the other piece is around basically forcing all those tasks to be similar to each other. That means you will know that there are certain pieces of information that are going to be necessary for many or most of those tasks, such as access to the real data in the staging Snowflake cluster or what have you. A lot of people talk about MCP, and actually when we did this migration, we did not use MCP at all. We were working at the level of using CLI, like looking at this from the standpoint of how an engineer gets the context that they need in order to know how to modify the code.

Engineers, we use a CLI. A CLI is pretty good at showing you what your test data is like in Snowflake. We equip the agents with just those CLI tools. Maybe this is obvious, at least in our experience, but you can't just describe the formula of the metric that you want the AI to rewrite for you in the new platform. If you don't have the detailed context about what the data in the table actually looks like, it's not going to write the right piece of code, same as how it would be as a human. The other thing that we have done, which some of you may find familiar, if you use recent versions of things like Cursor and Claude Code, is that we would get the overall goal state of the project really crisply defined in a text file, to basically define what success means for an individual task, what success means for the overall migration, so that the bot has all the right context, and critically how to use the validation tools to test its own work. Empowering each of those bots on that assembly line to have what it needs to be able to make progress.

A Self-Healing Loop

You set up a self-healing loop using the validator where for an individual task, the agent acquires the context that it needs using the tools that you provide it. The agent writes the code. The agent runs the validator to know whether or not the code that it wrote satisfies the task. If it doesn't satisfy the task, the agent tries again, looking at the details of the mismatch, and repeat. A problem that we ran into the first several tries trying to kick off the flywheel to migrate many metrics is just how quickly things go sideways if you don't have really good validation. We can say that all the different pieces of this automation are really important, but if you don't have sufficient context for the bot to be able to solve a task, the outcome is that the bot gets stuck on that task. If you are using an LLM that is not smart enough to be able to understand your giant piece of code that was written a long time ago with incomplete documentation, the outcome is going to be that the agent is going to struggle with that particular task.

If your validator is not working exactly right, you will let the agent run for a long period of time. It'll run task after task. You will have something that looks plausible, but is basically slop. You cannot make progress without having a really good validator, or you will burn compute all day or all week and then realize that you're not where you need to be. We found that actually, yes, we would even improve the validator multiple times while going over the span of the few hundred metrics that we moved as part of this, but it's critical to have a validator work exactly right. It's not trivial, actually. Many legacy systems aren't known for being super easily observable, just like how not all code is really easy to unit test. Ideally, it's easy to see everything that's happening inside and to be able to have access to data at the right points in time. You have timing issues and things like this.

It's critical to basically do all of those things to get you enough observability into the reference system for comparison to be able to have really good validation, kind of goes to the level of what might have been considered a nice to have in the world before, and which is critical now. There's a question of, how smart does the LLM need to be? Can they be like Minions? Can you do this with a LLM from last year or one that has a short context window or something? The LLM doesn't need to be as smart as a human engineer in order for this process to work, is the point. When you have this self-healing workflow, the worse that happens is you can't make progress because the LLM doesn't understand. Then you either need to work on how the context is being presented. There's a few steps that you can follow that I'll get into.

A claim you sometimes hear from people is that, AI is going to hallucinate. It doesn't always know how to do code the right way. It makes silly mistakes. It's going to write code for you sometimes that you're not going to want to put in production. You hear these kinds of concerns, at least from some people, although increasingly a few people. They don't need to be that smart. Then, because I like analogies and visuals, just to give you a sense of what success looks like, if you're running a project like this. I don't know how many of you grew up in the '90s like I did, but there was this game, Lemmings, they're not very smart, but you put little roadblocks in their place in order to make sure that they all get to where they're supposed to go. This is basically the assembly line architecture. For me, it's helpful to visualize things like this. Hopefully, some of you know what this thing is. Lemmings, what this game was called. It's a fun game.

The Code

Show me the code. What did we actually do? What does it look like? We had this, I previously called it goalstate.txt. We had a file called migration_goals.txt. This is a summarized snippet from this thing. The Goal State defines the definition of done. The overall migration will be considered successful when? Then you give it the definition of that, formulated in terms of the tools that you've equipped it with, to be able to detect the overall state, as well as the status of its individual task in the migration. You give it, in just the right level of detail, the tools that are available to it, that you expect it to use, to attempt and validate each of the tasks in the migration. In my example, we heavily use SnowSQL, which is, you run SQL from the CLI and it hits Snowflake. Where we went in and we equipped that thing with the right credential to have access to staging data, so that the LLM can just run that command and see what the data looks like in the tables that underlie the source data for these metrics.

We also equip it with information about how to find the existing migrated artifacts, and to be able to look for opportunities to reuse. We then tell it how to check its work. We do it in a way that we can also do, as engineers, to validate and check its work. Then we instruct it to the scope of the validation that it is supposed to do, that it's only supposed to check its work. It's not supposed to do future tasks. It's only supposed to do its task, and so on. Then we also have a file that we maintain just with the breakdown of the tasks. We have a long list of these things. We break them into phases to give our engineers easy stopping points to just double check and make sure that things are going properly while the migration is happening. We generate this list. When we did this, the first 10, 20, 30, we would have to redo those many times until we honed the validation script, and the context, and the tools.

After that, we let it go, and it was able to knock out almost all of the remaining ones in an extremely compressed span of time, just a couple weeks. The LLM would mark its next metrics that it just completed as complete as it went along. We had this file that we could observe through each commit that showed the migration finishing. We ran this in Cursor. Instead of telling Cursor like what I showed earlier, we would tell it this. We would say, familiarize yourself with migration_goals and migration_tasks, and then implement phase 2.3. Cursor, of course, is going to do its own to-do list and things like that, especially in the recent versions. It uses some of the same principles that we settled on, increasingly, in just the way that it operates when you use it as a tool to do any task. It might make a to-do list of several pieces in order to just migrate one or two of those metrics. It's about getting it at the right granularity. This is basically what we would tell Cursor after setting up that machinery to run the migration. You give it a very simple thing to do, and have the complexity, as much as possible, hidden in the validation step and behind the tools that you can give it to make sure that it has the context that it needs to solve a problem.

The Results

Comparison results, what we observed. We actually ran this for a large set of our metrics from our reporting platform. I like this because it's empowering for me as an engineer. There have been many cases in my career where it's hard to justify doing the right thing or the right thing, like using a better architecture, using a newer tool, just because of how hard it is. Just the number of things that can be tried now using this principle is really freeing. There are all kinds of things that we can do now because of the fact that you can boil the long segment of the timeline down into a really compressed window of time now that we can automate that third step, the understanding of the code. This is what this enables.

Challenges Faced

While we were setting this up, we did this like starting from a general analysis phase where we were just trying to figure out what was possible, all the way into actually running this out on our real metrics into our real new metrics platform. Along the way, we ran into a bunch of interesting situations. I was trying to summarize what are the kinds of things that can go wrong when you are running a migration like this. The first problem that will really mess you up that I mentioned before, is if the agent believes that it has succeeded when it has actually failed on a particular task. What you need to do there is make your validation tool really good. Most of the engineering logic in our project that we ran was in making the simulation validation script work really well. Another thing you will sometimes see is the agent will get stuck or not be able to make progress on a particular task.

Common reasons for that are the agent can't find the right kind of test data to cover a case where there's some nuance to the logic behind a metric that is hard to parse out of the example data that it can find, or where there aren't sufficient example queries that it can use for validation that cover the nuances of the metric that it is trying to migrate. This boils down to not having enough information in order to be able to do the task. Frequently, we would go back and find golden lists of IDs that have good example data, time ranges where the data is known to exist at a certain density that would be useful for bots when they're working on migrating metrics that use those tables to basically know where the good test data is. This is like the stuff you would otherwise ask from your friend on the other side of the cubicle, like, what are some test IDs that you use to test that stuff? What are some example queries?

The tribal knowledge stuff that people on the team might know. Getting that written into those migration_goals context files was really important at making it so bots could get through at a high rate of success. Then there were a small portion of ones that were so complex that they really needed an engineer to jump in. That was about, I'd say, 15% of them. That's pretty good. If you can cover 85% of your toil with automation, that matters a lot when you have hundreds of things that need to be migrated. The other interesting thing is around just identifying when you're partway through the migration that you could have architected it differently. Then being able to start over and rerun it, which is a very cool thing to be able to do that used to be a very difficult thing to do when you didn't have bots automating all those tasks. In the world where we would run these projects as Jira boards, if you have to rewrite the rules and start over, it's not just going to cost time, it's going to be really frustrating for the team, but the Minions or the Lemmings don't care. You can just do it again.

Migrations Aren't Mountains Anymore (The Paradigm Shift)

The main things that hopefully I've convinced you of, is that there are projects that we used to think about doing that we wouldn't do because they take too long or because it's too hard to get over the mountain and move the mountain. We have this new level of agility now. We have to use similar principles that we would have used in the past to get a group of junior engineers or interns to help with a project, except we need to make them standardized when we have really great validation scripts and we have clear context and tools. Once we have those things, we can set up that assembly line and just let all of the tasks flow through, like those Lemmings. It enables us to do new things that were really hard before. I know in my career in the past, like some of the biggest migrations that I saw, like I worked at LinkedIn before working at ServiceTitan. I worked there from 2012 to 2023, and there were several giant migrations that would be dozens or even maybe hundreds in some cases of engineers working for multiple quarters on some really complex goal.

Enabling safe active-active across different geolocations and the kinds of architectural changes that you need to make in order to make that possible, or the deconstruction of giant monolithic codebases and microserviceification, and the amount of toil and complexity that went into those projects. I think it's just a really interesting thing that I think about all the time, which is like, how might all of those projects have gone differently if we'd had the kinds of Minions that can understand code that we have today, if we'd had them back at that time? It's, I think, a fascinating thing. If any of you have seen similar things, I'd be really interested to hear what you've noticed about this. I think that this is basically one of the huge things that are enabled by having agents that can code. It's not just about making individual tasks go faster. It's about being able to do these giant impossible tasks really quickly, so that you can try new things, and you can do stuff that you wouldn't have done otherwise. The question to you is, what have you been thinking about doing that you can't do because it takes too many Jira tickets and too many quarters of work, and so you defer it forever? Whether you can actually do those things now by using ideas like these.

Questions and Answers

Participant 1: If you want to do this again now, what's the minimal path to success? Are there any tools that are already in production now that can really speed this up? For example, just by leveraging what a Copilot can do, what a Cursor can do, can this workflow be minimized as much as possible?

David Stein: Is there any way the workflow can be minimized based on what we can do now? The stuff that we did in order to move our few hundred metrics, these ideas around having goal state text files, increasingly, tools like Cursor and Claude Code guide people towards using approaches similar to this, not maybe formalized in the same way that I have with the validation and the physics engine type stuff. I think that someday, I assume that you will be able to ask Cursor this. Someday it will be able to figure out for you how to build the right kind of validator, the right kind of simulation engine that understands all the details of your legacy code, because it can crunch through the whole thing. I expect that someday you will have to do less of this structuring. There's one slide that summarizes the different pieces. Until then, I don't think that there's a simpler way. I think you need to actually find a good way to break it into a task that Claude 4 Opus or 4.5 Opus, or one of the really good models, can do, and break it into a good task for this, set up a really good validator, and just kick it off the way that I've described.

Participant 2: I think we should all have the same problem, how to automate ourselves around this grungy work. I tell my team this all the time. I do have a challenge. I don't know if you touched this during your experience. How do you guarantee security and privacy of your agents when they are working? Especially because you mentioned shell script and CLI, some of my creative agents actually try to communicate to AWS or run something that will format the machine, this kind of stuff. How do you make sure that your agents don't perform these risky tasks during the coding bit?

David Stein: What we actually do based on a concern about that, is we watch it carefully, and we equip it with staging data, and we don't give it access to keys where it could go do something bad. You don't want to give the agents the secrets to your production database or the levers that it could pull to delete something in production. That is dangerous. I think, though, even the tools that we use, like Cursor, in order to do these things, are still maturing in their ability to be able to have safe isolation. We have to be careful just not to give it keys that it shouldn't have, and to also watch what it's doing while it's doing its work. It's going to, I think, be an area that there are going to be more talks about at some point in the future, is how to actually build more of that safety into this. Because if you had an extremely sensitive type of application, you would need to be very careful about what tools you equip those agents with and how you're running that. You could scale it up to the level of rigor necessary for your application.

See more presentations with transcripts

Recorded at:

Jun 12, 2026

David Stein

InfoQ Software Architects' Newsletter