InfoQ Homepage Presentations Using AI as a Thinking Partner for Large-Scale Engineering Systems

Using AI as a Thinking Partner for Large-Scale Engineering Systems

View Presentation

Speed:

Download

42:10

Summary

Julie Qiu explains how AI serves as a "thinking partner" for engineering leaders. She discusses five distinct roles - Archaeologist, Experimenter, Critic, Author, and Reviewer - to manage the cognitive load of 400+ repositories. She shares how AI provides the "RAM" needed to synthesize legacy context, pressure-test designs, and accelerate high-level architectural decisions.

Bio

Julie Qiu is the Uber Tech Lead for the Cloud Software Development Kit (SDK) at Google, where she builds client libraries and command line tools across different language ecosystems to interact with Google Cloud products. Previously, Julie was a tech lead on the Go Security team, where she spearheaded Go's support for vulnerability management and Go's package discovery site, pkg.go.dev.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Julie Qiu: My name is Julie. I am a Senior Staff Engineer at Google. I'm currently the Uber Tech Lead for the Google Cloud CLI and SDK. Before I joined Google Cloud, I worked on the Go programming language team for several years, and I led the Go security team. Today, I want to talk to you about how I've been using AI as a thinking partner as I've been navigating large-scale engineering systems. My team builds the tools that developers use to interact with Google Cloud. That's the gcloud CLI and client libraries that are currently offered in nine different languages. These are the tools that Google Cloud developers use to authenticate, to call APIs, to manage resources, and to automate their deployments. In many ways, they define the developer experience that you have when you're using Google Cloud. How many people here do use Google Cloud? You're probably familiar with a lot of the tools that we develop. If you haven't, here's what they look like. Here is what a standard gcloud command looks like for creating a new storage bucket. Here's that same operation in Python. Then here's what it looks like in Go. The Go and the Python code that you see is what we call client libraries. They're the language-specific wrappers that we create around the API, and that's what makes it natural to use in the language of your choice. On paper, if you look at it, this is actually a really simple process. We have some teams, and their job is to define the API. We call these the service teams. This is like storage, Pub/Sub, BigQuery. Every service team is the one that actually owns the service. They're the ones that define the API and the surface that it expresses, such as the parameters it accepts or the methods it exposes, and all the data that is being returned. Those teams describe their API using the shared specification format. Then the specs is what my team then uses as raw material for our generators. From there, what my team does is we will go and run generators for all of these nine languages. We take the same underlying API descriptions, and then what we do is we will layer on things like how to authenticate, or we'll put in veneers to make it nicer to use. We'll add cloud features, product features, language features, and so on. Then we release these to users by publishing them to their respective package managers, so PyPI, npm, Maven, and so on.

The thing is, underneath this whole system, it's actually really complicated. It's one of those things that happens when you have a multi-language system, and it evolves for multiple decades, and there's multiple teams working on it, and all of these inconsistencies just start to rise pretty organically over time. All of these differences were probably pretty small at the time in which they were made, and they probably were also super logical for the context in which it was made for. Over time, this whole system has become incredibly hard to see. The problem isn't that any one area is hard. It's that it's just too big. It's too big to fit into a single document. It's too big for me to draw on a whiteboard. It's too big for me to even pack into a single human brain. When I first joined the team two years ago, I figured all I need to do was just gather evidence. I started this intake process in my team. What I said was, all I want you to do is file bugs, and file all of your friction, and file all of your ideas, and we'll just take all of that data together, and we'll be able to figure out what's going on. I figure if I can get all of the problems and all of this historical context in one place, then the solution would just naturally emerge. This intake process did work. What it really made me see was just how much my team was dealing with on a day-to-day basis. What I couldn't see, though, is that behind all the volume and all the edge cases and all the long tails, I couldn't see the shape of the system that all of this fit in. I was really overwhelmed by all of these details. Every time I tried to zoom out, I feel like I would just run into a different problem. I could sense I had a blind spot, but I couldn't really get the picture into focus. The reason was because my design ideas just didn't fully connect. I knew what pieces mattered, I just couldn't put them together.

The Five Distinct Roles of AI

Today I want to talk a little bit about how AI has helped me reason about the system, and how it's also helping me redesign it for the future. Usually when we talk about AI in software, what we talk about is developer productivity. We talk about moving faster. We talk about code assistants. We talk about autocomplete. In my role of TL in an organization of over 70 people, the bottleneck for me hasn't been with typing. My bottleneck has actually been around holding context. I don't really need something to write code for me, but what I do need is more RAM. I need a way to hold the state of everything that's happening across our org and across over 400 repositories, and for all the relevant pieces and all the information I want to just be surfaced to me at the time that I need it. I needed a way to not have to hold it all in my head at the same time. Over the past years, I've been using AI to help me with this problem, I've noticed that five distinct roles have emerged for me. The first is AI as an archaeologist. I use AI to help me look through the system and piece together what's really going on. The second has been AI as an experimenter. What AI lets me do is it'll let me simulate the ideas that I have and then be able to figure out whether or not that idea will hold up before I commit engineers to months of work. The third is AI as a critic. I'll often talk to AI and say, tell me what's wrong with my design, and it'll happily tell me what's going on. The fourth is AI as a co-author. I actually use AI to help me write production-quality code. On the flip side of that is using AI as a code reviewer. It helps me elevate the quality of my code by catching issues and clarifying logic before I have to send it off to a human reviewer.

Hypothesis - Simplification

Before I started working with AI on this project, I had a hypothesis. My hypothesis was just that we could make all of this simpler. We didn't have to have such a cognitive taxing process for everyone. We could have a way to work together that just made more sense across the board. Instead of having every single language team go and implement their own build, test, and release pipeline, I thought we could just collapse this into one single production pipeline: one CLI tool, one release process, one technology stack, one configuration system. What I wanted was a system that felt like it was elegant enough that I could use it to unify all of our workflows. Simple enough that my language teams could actually just focus on the language-specific work that was delivering product value. Then also flexible enough that when we really did need to differentiate by those product and language-specific features, the system would let us do so. On paper, this seems possible. After trying to tackle the pieces of this with different teams over several months, we just kept running into different blockers. Every single attempt we made surfaced different challenges. The first attempt, we figured we would just design the entire system end-to-end for one language. It's what some people might call a steel thread. At the time, that felt like a really reasonable approach. You pick one language. You go deep. You make it work. Then, very quickly, over a few months, we realized we were just taking on too much. The system also started to shape itself towards the one language that we had picked. Over time, we just realized that even though the core idea was still promising, it would take us years to migrate to this new system. What I also didn't fully appreciate at the time, when we started this project, was just how tied the technical design was to the philosophies of the engineers behind them. When you're an engineer and you're working on a specific language system, you care a lot about that language ecosystem, and you care a lot about the details behind it. Sometimes those details don't actually really matter when you're looking at the architecture. Giving the engineering team that flexibility and feeling like they had the buy-in and actually making them feel heard was a critical component of making this project work. I understand that, because as an engineer, I might be willing to do something because someone told me so. I'm a lot more likely to get on board if I actually felt like you understood the challenges that I was dealing with.

We restarted and we did a second attempt. What happened in the second attempt is we narrowed the scope. Instead of building everything all at once, we just took a thin slice of the system and we picked two languages. We figured we would just tackle these two languages simultaneously. Very quickly, the details became overwhelming again. In trying to generalize very early, what we ended up with was all these abstractions that were so generic, it made the state completely unmanageable outside of the language-specific areas. We also ran into staffing constraints. In choosing to build this pipeline, we chose Go as the language to build the entire system in. The team that we had put together for this was coming from all different language ecosystems, and actually teaching them Go and getting them on board to learning a new language became yet another bottleneck. We had to restart the project again. After this time, I felt like I had read hundreds of design docs. I had read so much code and reviewed so many PRs. I just felt like I needed to reset entirely and start from first principles. I went back and I traced through how we had gotten here. Then I just went on my terminal, I made an experimental directory with an empty folder, and created a Markdown file. Inside this README.md, I wrote down the simplest possible version of the system that I had in my mind. At its core, the CLI had three responsibilities. The first is it had to read state and manage that state and then be able to write it out when necessary. The second was that it had to take the specs and the configs and then create a client library from it. Then third is that it had to release it and publish it to a package manager. I took this design and I went to Gemini CLI, and I said, read this file and build it for me as I described it. With one prompt, it did. It was amazing. It gave me the CLI framework. It gave me the flag parsing. It gave me the command structure. That was obviously the very boring part. Without a migration strategy, this tool is just a toy. I couldn't apply my shiny new tool to the ecosystem that I wanted to without understanding that ecosystem first.

AI as an Archaeologist

That brings me to AI in its first role as the archaeologist. When you're working in a system and it's been built over many years, the documentation is often stale or incomplete. The only thing that you can really trust is the code that's actually running. If you take over 400 repositories and you have to go read all of them, that's a very slow, very manual, and very mentally exhausting process. I figured I would just start small. I would pick one repository. I knew that in our Python repository, all of the configs for the majority of our stuff lived in one place. I cloned it locally. Then, again, I opened up Gemini CLI and I asked, how does the Python generator actually work? What does it take as input? What does it reproduce as output? Again, with a single prompt, AI distilled thousands of lines of code down to their essential behavior. It identified the inputs, the process, the outputs. It actually told me about all these files I didn't even know existed. It also mapped the generation work end-to-end. This did really help me, because normally this is something that would take me weeks to do. Then I was like, what happens if you just take all the files you just found and consolidate it down to a single YAML file? Tell me what that's going to look like. Again, it did it. It gave me an answer. It's not a very pretty answer. I think this file was probably over 10,000 lines long. It did give me a starting point and immediately surfaced more questions for me. Like, what's all those regex doing here? Can some of this metadata actually be derived from other fields? Do we really have to hardcode all of these strings? Which part of this API configuration is actually specific to the Python team? How many of it is actually being reproduced over nine different languages? It should really just live in a language-neutral layer. I didn't know, so I kept asking AI. I spent hours just working through with it, figuring out how the Python generation flow worked end-to-end, piece by piece. The answers it gave me wasn't perfect. Sometimes I would have to stop, because I'm looking at the repo, saying, are you really sure this folder that you're talking about actually exists? Of course, I was absolutely right, it would tell me. Of course, it doesn't exist. Here's what actually does exist. Other times I would have to give it more context. I would be like, look at this GitHub issue or look at this file. Does this change your analysis?

Sometimes I just got things wrong. It would omit files. It would omit parameters. It would invent new files, invent new parameters. It would be very confident every time. I wouldn't notice some of these things, actually, until I started prototyping and just found that things didn't work. After a few hours of back and forth, I did learn a lot. I knew a lot more about the Python ecosystem than when I had started. I wasn't really sure what to do with all this information yet, so I just asked it to write our conversation down into a python.md file. Then I did the same for Rust. I used Gemini to dig through how the generator worked, what configuration it used, all the inputs and outputs. I found that Rust had a completely different philosophy. For example, Python likes to bundle its multiple major versions into a single package. The Rust team liked to ship one crate for every major version. The Python team had one primary config file, it was over 4,000 lines long. The Rust team would have one config file per folder, and usually it was only about 10 lines of YAML. The Python team had lots of automation. The Rust team liked to run things from their laptop. What I realized as I was repeating this exercise for every single language is that the pattern was really clear. Everyone's actually just trying to solve the same high-level problem, but the implementations had diverged pretty dramatically over time. AI spared me the weeks of repetitive labor of installing every generator, of navigating the codebases, and reading a lot of old docs, and I was actually able to reconstruct a lot of the logic. By the end, I had this folder of all these Markdown files in different languages that summarized how every pipeline actually worked. As I read through them, what I could see was what was the same and what was different about the shape of the systems that existed. Something emerged for me that I didn't have before, and that was this map. For the first time, it became very clear to me that the system is actually divided into three very concrete pieces. There were components that very clearly should have been owned by the service team, because it was specific to those product teams and the products that they were trying to create. There was a component that should have been owned by the platform team. Because these were infrastructure that was language agnostic and it didn't really matter what language it was written in. Then there was this component where the actual language-specific knowledge is really important to create that idiomatic developer experience. The language team should really only be focusing on those aspects of the system, because that's what requires their expertise. What we had done was actually have the language teams own everything. The org structure that we had was really what was causing these differences. As I compared patterns across languages, I could finally see what was genuinely different versus what only looked different because of history.

AI as an Experimenter

Once I saw that, my design started to click. I thought that this one CLI and this one release system and this one pipeline really felt like something that was much more feasible than it felt the attempts that we had made. I finally understood the system well enough just to imagine what could exist. That's when the second role of AI emerged, and that was AI as an experimenter. Once I understood the system well enough to see the shape, I could start asking a very different question. Up until this point, AI had just been digging up the past for me. It helped me understand what exists. If you're designing, design doesn't happen in the past. Design happens in the things that you imagine, and then making the really hard tradeoffs about what you actually want it to be. Redesigning a system like ours is extremely expensive. If I had a question about Python and I wanted a prototype, I would have to pull someone off the Python team for a week from whatever they were doing. If I wanted to see the same idea in JavaScript, I'd have to go to the manager of the JavaScript team and ask them for the same staffing. If I'm going to do that across nine different teams and then also keep track of what all these engineers are doing, that's a lot of coordination just for me to be able to answer a single question. Is this idea even viable? You can't ask a team of engineers to go and read hundreds of thousands of lines of code just to see if there's a pattern. You can ask AI to do that. That's when AI, for me, became this low-cost simulation engine for understanding architecture. I could take this mental model I had, this thought of what I thought was feasible, and what I had with the buckets and the shared concepts and the differences that I saw, and I can just start running experiments without leaning on my team. By this point, I had already spent days studying our generation and release workflows, had all these hypotheses about what could work, and I started using AI just as a prototyping partner so that I could design the system and understand it without committing to building the full thing. I set up my development workflow to match that mindset. My main branch, I talked to Gemini purely about design. I said, don't write any code, just focus on the design. All I want you to do is talk to me. Then I used git worktree to set up three different directories and multiple branches on the repository. One for Python, one for Go, and one for Rust, since those are the languages that I had decided to choose, since I knew they had real constraints that were very different from each other. Because I had these separate branches, I could then freely experiment on every language without worrying about them bumping into each other.

Then I wrote this prompt, and I gave AI a role, and I told it, it was my design partner. That way, every time I opened up a new session or I was opening things up in a different tab, I could just have it read this prompt, which would tell it where all of my artifacts lives, some of the alternatives I had already considered, and the to-do list I was working through. On each of the language-specific branches, so, for example, on the Rust branch, I would say, go read this prompt.md and get context of what I'm trying to do, then implement it in that language. What mattered here with AI as an experimenter is not that the code was perfect, because it definitely wasn't. What mattered was that the implementation exposed exactly where the design had been underspecified and where it was awkward or where it had just hallucinated ideas for me. Once I had this set up, I could then actually start testing all these hypotheses I have. One hypothesis I had was maybe our configuration is actually pretty redundant. For example, I noticed when you look at the config file, the global file writes down all of the versions, but then so do each of the language-specific files. For Python, it's in this version.py, for Go, it's in this version.go file, and then Rust has this Cargo.toml file. That seemed unnecessary, so I talked to Gemini, and I said, should we just remove the version field from the YAML config file? It already exists in version.go, version.py, and Cargo.toml. Of course, I was absolutely right. Single source of truth is better. Remove the duplication. I went ahead and experimented, and I implemented it, and I very quickly found out that actually some of the files I had mentioned were generated using the global config file. It just wasn't that case across the board, which is why I didn't spot it right away. The experiment made it very easy to not miss this detail. Then it started raising other questions for me, like maybe all of the version files should be generated. Why are some of them constructed manually, and why are some of them generated?

There's another pattern I jumped at around naming. If you look at this file, there seems to be a strong relationship between the name of the library, the folder it gets outputted to, and then the API path. I asked Gemini, how many cases can I actually infer the output and the API path just by looking at the name? What it found was really interesting. What it told me was that the name, the API path, the output directory, and then this YAML configuration file that all the libraries were using followed a really common pattern that was true across all languages. For example, if the name was grafeas, the API path would be grafeas/v1. It would live in a folder called packages/grafeas. Then the config file would be grafeas_v1.yaml. What I could actually do is we had this repository of all the API specs that the service teams were writing. I could just write a script to traverse that directory and then make a lot of inferences based off of the data in that repository, rather than manually maintaining this list. This would actually eliminate about 100 entries from every single language's config file. Another experiment I had, I actually went over a really strong assumption with. It's this idea of, is it better to have this one large global file, or should I have many small files? Because if you think about it on paper, having one large file that's over 4,000 lines long seems not very scalable. After all, Google Cloud is going to grow, and we're going to add more APIs. Then what I realized from the experiments is that because so many of those fields were redundant, after you moved everything, all the redundancy out, it was actually much better to have this one single file. It made it really obvious when there were patterns, duplication became really visible, and doing stuff like search and replace just became totally trivial. Before I did this experiment, it actually felt really counterintuitive to me. I thought that having one file per crate was actually much better, and Rust had a cleaner design. This experiment helped change an important assumption I had.

I also used AI to help me reason about the CI interface itself. One question I kept coming back to was, should I have positional arguments or should I have flags? Originally, generate took this name flag, but then when I realized it was required across all of the libraries to generate them, it wasn't necessary. I could then update my README.md, and then after change in design, I would tell Gemini on each of my language-specific branches to rebase it, and then have AI update everything and rerun all the experiments again. This let me keep my core design in sync as I was working through the details in each language. More importantly, it actually helped me flush out subtle inconsistencies, places where things sounded really good because I was absolutely right every single time, and where I was absolutely not right. Finally, I also used AI to confirm that I could actually migrate all of the legacy configs at all, and so it was great because due to all this pattern matching, it could actually help me confirm that the legacy files and the file that I was trying to create would really work. Another interesting thing about all of these experiments was all the things that it helped me surface that I would have never been able to see before. For example, every single one of these language generators, they produced client libraries, but they actually produced a bunch of other files instead. Some of them produced READMEs, some of them produced changelogs, and then the contents of these READMEs and changelogs would be slightly different because they lived in the language-specific generators. Once I saw that, I thought, we should probably standardize this. This seems like a thing we should do that would improve the user experience and be easier for us to maintain. On the flip side of that, I also saw the same thing with the inputs. Some of these language teams were having each of the service teams maintain this YAML file of language-specific hints. Some of them had these Bazel targets. Then some of them just had this glue that they had come up with. Probably every choice did make a lot of sense in the context for which that choice was made, but then when you look at the whole picture as a whole, it's pretty clear that you can actually just eliminate all these files or at least condense it down to a single source of truth. One unexpected benefit of AI helping me with all these experiments is actually that it didn't just help me with the design, it also helped me be able to have much more interesting conversations with my team much earlier than I normally could. AI helped me get the ideas to a place where I can actually share them and then check it with my team to see what their ideas were. That human feedback was incredibly valuable. It showed me where my mental model lined up, where things worked in reality, and where it didn't. It even exposed places where AI had hallucinated, but it had convinced me that I was absolutely right. Then when I showed it to my team, they were understandably very confused by some of the things that I was proposing. AI had helped me sketch just enough to answer a few simple questions. Like, does this workflow really work, and am I missing something obvious? That wasn't something that I could have been able to come up with so quickly on my own, but actually having my team involved earlier in the process was what helped us move the design forward.

AI as a Critic

At this point, I have a user guide, I have a design doc, and I have a lot of principles. People can react to my big ideas, but they're not going to go sift through every single design decision that Gemini came up with at 2:00 in the morning. I needed a different feedback to help me with those. That's when AI stopped being my lab partner and started playing a very different role, and that was AI as a critic. I started asking AI to just start poking holes in my designs. Like, instead of saying, what could this look like, I asked, where am I wrong? Where could this go wrong? One of the most insightful things that it helped me with is, I would say, what parts of this design looks overengineered, and what parts just seem unnecessary? For example, I was looking at the Go config, and I noticed that it had this remove_regex field almost on every single line. I was like, can we remove this? Is it always the same? It went and looked at 185 libraries and found that in 179 of them, it followed a completely predictable formula. There were only six libraries where we actually had to have this Go-specific config, and that let me delete 3,600 lines of configuration, which was incredible. It didn't always get it right. Another time, I asked it to help me simplify my release workflow. I said, should we just have one release command? It can commit, it can tag, it can push, it can publish, and we'll have all these defaults, and we'll just use flags to skip things. Of course, I was absolutely right. This is exactly what cargo release does. Why wouldn't we do it? It's a well tried-and-true path. Then, as soon as I brought this to the engineers on my team, they said, wait, so does it publish by default, or does the execute flag also create a tag? If the publish flag fails, then how do I retry it without retagging? Also, why did you turn publishing off by default again? Isn't that what the whole point of this command is for? AI was helping me offer these textbook solutions that sounded really simple on paper, but this simple interface, when you brought it into the context of a team working together on a release workflow, actually made things a lot more complicated. You now have to understand how the defaults work instead of just running separate commands to do everything one by one. What AI did do for me as a critic is that it consistently forced me to question my decisions and to justify what I was doing. For me, just having that scrutiny was the real value. By the end of this phase, I didn't just have a prototype. I had a design that had been interrogated from multiple angles. It had been simplified through critique. It had been refined to the point where I felt like I could stand behind it. AI gave me a way to pressure test my thinking before involving dozens of teams and months of engineering work committed to this project. Finally, I felt like I had the confidence to go and implement it in production for real.

AI as an Author

That's when AI became my co-author. My experience with AI is that you can't just say, write the whole thing. At least not if you want something that you can actually show to someone and feel like it's going to be maintainable long-term. Just like anyone you're handing out work to, you have to be specific about what you're asking it to do. You have to delegate the task down. You have to expect that you'll have to review the end result. When I was writing Go code with AI, here are some quirks that I consistently found. It over-commented on everything. I don't mean that it was very helpful documentation, I mean that it would literally narrate what was happening line by line. For example, I wanted to write some code to read and write a YAML package. What it did is it told me that it was defining the input and output file names, it was reading the YAML file, and then it was unmarshaling the data. None of these comments is telling me anything I didn't know just by looking at the one line of code right below it. The funny thing is that the comments restate the syntax, and they describe what the code is doing. The irony is actually having comments like this makes it harder to understand the code, because I now have to actually filter out the comments and filter out the noise that it's generating just to see the structure. Here's another thing. It loved whitespace. I don't mean, again, like whitespace for readability, I just mean it loved adding whitespace because for some reason whitespace makes everything better. In this case, I have a switch statement, and what you can see is that for every single switch statement, it's not doing anything fancy, it's just running the generator for that language. For some reason, when I use Gemini to generate this, it put a new line in between all the case statements and all the functions, and so what it did was actually broke up the visual grouping that I normally am used to when I'm writing Go, and these cases are conceptually the same. The whitespace really just made it feel unrelated for no reason. Here's another one. Loves checking for nil, because we all love safeties and we all hate panic. In this example, what the function is doing is I have this createLibrary function, and it says, if the variable you're passing in is nil, then return nil. Then what happens is it calls a helper function and it's empty, and then it does the same thing again. Now while the AI is trying to be safe, it's actually made the API contract blurry because should I expect that it's empty would be willing to accept a nil value to begin with? AI is trying to be safe, but what I think is that it doesn't actually understand where safety belongs, so instead of establishing a clear invariant one time, it just checks for nil everywhere. It also, which I thought was really interesting, consistently forgot to run Go formatting tools, like gofmt or goimports, and I would get code where the imports wasn't grouped together or the spacing was off, and I would have to remind it to run all these commands at the end. For anyone who's written Go before, you're probably familiar with the fact that these are pretty well-defined specs. It's not some style that I came up with. There's public tooling. It's standardized. It's interesting because it felt like the AI was generating something that looked like Go and kind of was like Go, but it didn't really actually understand the structure that Go developers were using to communicate with each other.

There were things, though, that it did remarkably well, and I very much became reliant on it to do. One of it was alphabetizing. I find the alphabet challenging, but I love being able to pass a struct to AI and say, just alphabetize these fields for me, and it would reorganize everything with ease. Another thing that it did really well was refactoring imports. When I was moving code around, if I had to move a package from one place to another and I had references throughout the repository, it was incredibly useful because AI would just go ahead and rename all of those imports for me, and in the past, I would have to figure out what said command I was running to make this happen. Then another useful thing I loved was using it to fix compile errors and linter errors. A very common thing that I might do when I'm refactoring code is say, I have a function and I want to pass in another argument, and now I have to trace through the stack to figure out all the places where that argument now needs to be passed. What I would do is I would run go test, and I would get my list of compile errors, and I would just go through every file and edit it. Instead, what I could do is just tell Gemini to fix it, and I would hand in all the errors and it would just do exactly that for me, and that was amazing. It would update the call sites. It would propagate the parameter. It would just get everything compiling again. I think that's the pattern I kept consistently seeing when I was using AI is that if you give it something concrete, repetitive, where the correctness criteria is very explicit, it is an amazing accelerator. I also found that if I gave it a style guide, the output would get dramatically better, and a lot closer to what I wanted. For example, I had passed in documentation from the Go website, so for things like the effective Go manual or the Go commenting, the Go formatting guide or the structured logging out blog post, and I would say, "When you write code, make sure you follow the conventions that are described here." It would do that, and it would improve the results and the quality of the code that I wanted by a lot. All of this was the kind of work that I felt like I would normally just do by myself if I was watching television or listening to a podcast, but now AI does it in seconds, and all I have to do is review the results. I still had to write the important logic myself, but for that small repetitive task, AI let me get to the same quality in a fraction of the time.

AI as a Reviewer

This leads me to the last role that AI played for me, which was as a code reviewer. One of the biggest complaints that people have is AI slop. It's the fact that you're generating this code and then dumping it on the reviewer. I wanted to be very cognizant of that. I had the gemini-code-assist plugin running as part of my workflow, and what was great is that when I would send a PR, it would just check for all these things. Sometimes it would find that there was a missing error check, so in this case, it caught that a file shouldn't exist and could be present, but readable, which would cause the test to pass silently when it should fail. Here's one where it commented on the structure of the code and told me to use a table-driven test. Here's one where it caught that the other AI didn't run the Go formatting tool I wanted it to. Here's another one where I should have been using a constant instead of hardcoding this, the number three in a bunch of different places, and so it let me be able to keep things in sync. Then sometimes it caught critical errors where the code compiled. Here's some stuff that I was writing while I was half paying attention to a meeting, and it definitely looked like it would make sense, and then AI immediately flagged two bugs that I was able to fix, luckily before someone on my team had actually gotten around to reviewing it and was a little bit confused in the process.

The Limits of AI

Just like all the other roles we talked about, AI had its limits. For example, here's a PR that I was creating for said YAML package that I was talking about, and it compiled, test pass, and all of this stuff. Someone on my team immediately quickly jumped on it and said, did you notice that there's this change in there, and why are you packaging it with this PR? It seems completely unrelated. Here's another example. What I did was I had flattened out our YAML structure, and AI had suggested a bunch of useful mechanical feedback, but someone on my team immediately said, having that duplicate function makes me really sad. Can you fix that? It's funny, because the word sad, when someone says that, that's the intuition talking. Both versions worked perfectly fine, but that feeling comes from a sense of knowing what good looks like. He also asked me really useful questions like, where is this project going, because you have this field called language, but you told me we might be using this for gcloud also, so in that case, is gcloud a language? This is because when engineers are reviewing your PR, they're not just thinking about the PR they're looking at, they're thinking about the roadmap, they're thinking about where to go, and things like that. AI couldn't help me answer questions about things outside the context that I had been given. Here's one last example, which was a discussion around tradeoffs. Someone on the team had said, we have a bunch of sources in this struct, why don't we consolidate it into a map, because then we don't have to type as many things? Then someone else jumped in and was like, actually, I think we should keep it there for now, because we're still in the design phase of this process, so actually having the fields be explicit can help us rethink the structure later on. That's the kind of decision that you can only provide if you knew the context of the project, what we were still trying to do, and what kinds of mistakes we wanted to make easier to catch.

Summary

The project I've been talking about, it's still ongoing. When I started this work, I genuinely thought that the hardest part would be figuring out how the system worked across all of the dimensions that I talked about. All the languages, all the components, all the workflows, the products, and just the decade of drift. That was really hard. When I brought AI into the loop, it helped me see a lot clearer. Then I realized that the hard part actually came after, which is, what do I do with all this information? What do I keep? What do I change? What do I throw away? AI helped me shine a light into the corners. It helped me map what was wrong, what was redundant, and what was because we've always done it that way. What it couldn't do and what it still can't do is replace the messy and the hard to articulate work of understanding, synthesizing, and deciding. As an archaeologist, AI can reconstruct things for you. What it can't tell you is why we made that decision. For that, I needed the engineer who had been around at the time when the decision was made. As an experimenter, AI could bootstrap a bunch of ideas for me very quickly. It couldn't really validate that I had gotten it right until I went to a real team. As a critic, AI could tell me the differences and the inconsistencies, but it could not replace the judgment of my teammates. As an author, AI could write me code, but it leaned on my years of experience to actually know whether that code was good. As a reviewer, AI caught tons of mechanical bugs. It was the humans who caught the things that existed outside of the code, the things that required the context and the roadmap and the way we're trying to go to really push the project forward.

Lessons Learned with AI as a Thinking Partner

Here's what I've learned with my time using AI as a thinking partner. I thought at the beginning that the problem was, I just need more of me. I just need to be able to clone myself and then I'll have more hours and more hands and more time. What I realized is that cloning myself wouldn't have solved the problem. Because the bottleneck was never in my hands or how quickly I could type. It has always been in my judgment, my understanding, the synthesis and the messy work that it really takes to decide what actually matters. AI can't give you that. It can't replace your judgment or taste. It can't replace the intuition that you've earned from spending years in a domain. There's a reason for that. We've talked about how AI is trained on what already exists, on the average of the past. The most interesting engineering problems, I think, don't actually exist there. What AI can do is it can help you think. It can challenge your decisions. It can ask you sharp questions. It can push back on you just like a curious teammate. It can take on work that you do wish you could clone yourself to actually do. The only reason I think it can do that in the first place is because you have the expertise to know what to do and to know where you're trying to go. What AI gave me was something that I have always wanted. It gave me those extra set of hands to do the work that was too rote to be interesting but too context heavy to delegate. It was work that I knew how to do. It was just stuff I didn't want to spend my limited time and energy on. AI didn't make me ten times faster. What it did do was it made me ten times more present in the work that I was trying to do. It gave me leverage over the parts that become mechanical. It allowed me to then reinvest those parts into the things that were still growing. That's the real value that I have seen for AI, at least for me, is freeing yourself to spend your time on the things that truly excite you and only you can do.

See more presentations with transcripts

Recorded at:

May 15, 2026

Julie Qiu

InfoQ Software Architects' Newsletter

Using AI as a Thinking Partner for Large-Scale Engineering Systems

Summary

Bio

About the conference

Transcript

The Five Distinct Roles of AI

Hypothesis - Simplification

AI as an Archaeologist

AI as an Experimenter

AI as a Critic

AI as an Author

AI as a Reviewer

The Limits of AI

Summary

Lessons Learned with AI as a Thinking Partner

Related Sponsors

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ