BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year

From MCP and Vibe Coding to Harness Engineering: How Did AI Native Engineering Evolve in One Year

Birgitta Böckeler, Distinguished Engineer at Thoughtworks, returns to discuss the rapid evolution of AI in software delivery. She touches on the evolution from vibe coding, the changing tools landscape and the more autonomous agents that, besides higher velocity, introduce higher risk.

Key Takeaways

  • AI Development moved from verbose vibe coding to more autonomous context engineering. In this shift, monolithic files and MCP servers were replaced with lazy-loaded 'skills,' CLIs, and scripts, which conserve context window space.
  • The discussion on user interfaces for agents is split: terminal-based tools like Claude Code are useful for headless execution, while IDE-based tools like Cursor provide a preferred visual view for debugging and complex tasks.
  • 'Harness engineering' is a new concept aimed at building confidence in autonomous agents. It involves creating "a harness" with 'feed forward' tools (conventions, architecture context) and 'feedback' mechanisms (static code analysis, test suite results) to allow the agent to self-correct and reduce human supervision.
  • Supervision should be governed by a risk assessment that considers the probability of AI success, the impact of failure (criticality), and the detectability of errors.
  • The AI Native space is still in the "forming" stage, with developers still finding the right balance among risk, agent autonomy, and the appropriate level of human supervision to ensure code quality and safety.

Transcript

Olimpiu Pop: Hello everybody. I'm Olimpiu Pop, an Info editor and I have in front of me Birgitta Böckeler. Probably you remember that we had a single conversation last year, but as things are changing so fast in the AI landscape, we invited Birgitta again to see what changed in the 365 days since she last looked into that, at least for Con. Birgitta, if you want to introduce yourself, anything changed on you in terms of position at Thoughtworks?

Birgitta Böckeler: Yes. Hi, Olimpiu. Thanks for having me again. I'm Birgitta. I'm a distinguished engineer at Thoughtworks, and my distinguished topics, to say, is how to best use AI on software delivery teams, and I've been doing that full-time for two and a half years now.

The State of the AI Engineering Space [01:21]

Olimpiu Pop: That's a lifetime in AI. Two and a half years. It's like middle ages, all the way back to the future. Last year, we discussed vibe coding, because that felt that was the topic at that point. And obviously there were still ideas, people were just having bits and pieces that they generated and then they just stitched together. It felt like maybe a fancier Stack Overflow copy pasting stuff because you didn't have to go and move from the ID. And at that point, it felt like you have a leader and that the leader was usually those that were based on the Microsoft framework. Cursor was the one that was doing a lot of innovation and then everybody else was copying. How do you feel things now?

Birgitta Böckeler: So last year at Con, I gave a talk, the title was From Auto Complete to Agents, because what I still call at the time agentic modes. So it didn't quite feel like a full agent yet. Those agentic modes were just gaining traction last year in March, April, and the vibe coding term was about two months old at the time. And like you said, Cursor was one of the, if not the most popular coding agent or coding assistant. And Claude Code wasn't even around yet or at least not big in the public discourse. I think that might've been an early release already or something like that.

So about a year later, I would say that Claude Code is arguably the most popular coding assistant, coding agent, even though Cursor is still very much relevant, but now it seems like the Cursor team is more and more copying from the Claude Code team because they're kind of setting the pace in terms of ideas and stuff like that. And another thing we talked about about a year ago was MCP servers, which had just started happening and were super popular. And there's now also alternatives of how you can do stuff, not with MCP servers because we found some of their downsides. So definitely a lot has happened over the past year.

Olimpiu Pop: Yes, and in the company that I'm working, we do use Cursor, but mostly our designers are using it because it's easier for them to integrate Figma through Cursor and then generate. We use Codex rather than Claude. Don't ask me about the reasoning, but that's what we are doing.

Birgitta Böckeler: Yes, I mean, if you want to stay on that for a bit, like Claude Code versus Cursor, or let's say terminal-based coding agents versus IDE-based coding agents, I think because Claude Code has become so popular and it was the first terminal-based agent that became very popular, even though there were other open source ones before, people often equate it with like, "Oh, it's so popular because of the terminal experience". But I actually don't personally think that's the case. I think Claude Code works really well under the hood and you can have the same experience in a more, let's say, graphical user interface, I think. So I think it's still a preference. I mean, of course, these terminal-based agents, they have the big advantage that you can run them in a headless mode, so you can also run them in your pipelines or run them more in the background. But now all of the other big coding agent, coding system products have started copying this as well. So you now have a cursor CLI, you have a GitHub Copilot CLI, all of them have CLIs now as well.

So now it's more like a preference or also decision which situation do you use the terminal, which situation do you use the IDE? And like you said, designers are maybe more comfortable using the IDE because it's still not quite all the way down there in the terminal, which they're maybe not as used to. But I have to say personally I use both. So they are my top two coding agents that I use at the moment, Claude Code and Cursor. And I sometimes just prefer the more visual view that I have in Cursor. And it also is very strong under the hood. But I sometimes prefer in the modes where I want to watch what the agent is doing. It's easier for me to process what's going on.

And it also has some features that you actually don't have in the terminal, which is like rolling back to an earlier state in the conversation and stuff like that, it's just easier to do in a more graphical environment. So I don't think it's as easy to say Claude Code is successful because of the terminal and that's the future. I think there's still a lot of benefit and maybe it's a personal preference, but I think there's still a lot of benefit for the graphical projection of what's going on as well.

Context Engineering, Skills, and the Decline of MCP Servers [05:39]

Olimpiu Pop: For me, what I think Anthropic, the company, manages to do a lot of great stuff is they do a lot of research. And I enjoyed reading about their red teaming, about trying to see how much can they push the agents and well, the models even before that. So that's what I think they do a lot of great work. The other day I started watching, I think it was a black hat, where they were just pointing how they use it in terms of finding vulnerabilities to just see how much can they turn the model. And obviously you see the tool.

Now in terms of the experience, I see a lot of my colleagues and people that are around me that are coming from the product or designer space that do it, so at least that's how we look at it. We don't care about the code itself. We just generate something, it works and we have a testing suit that we are aligning, and then the developers come and they just start improving those things, creating components that actually, if we don't have them already, and then we are just sandboxing it around that. So okay, use those component and then everything is just glue code that is easy to just clean up and make things better.

And I think both tools are important because, for instance, it's just throwaway code and I just care about how it actually goes. And I spoke earlier this week with [inaudible 00:06:56], and I think for me it's the first, let's say, public maybe production project that somebody worked with vibe coding, well, whatever it's now. And he was talking about his implementation of Hardwood, which is a Parquet implementation in terms of data storage. And that was quite interesting the way he was mentioning and making sure that the code is nice and that it's about optimization. And I think there are moments when we do have to use it.

Birgitta Böckeler: Yes, and also maybe to go back to the designer use case, I think even before designers are using, let's say, Cursor or maybe like a rapid app generation tool like Lovable or Replit or something like that, right? As developers, we can prepare the environment in which they're creating the prototypes to make it more probable that we have to do less polishing later. For example, when I create a prototype with something like Replit and then I download the code, it's usually the same tech stack. I know that they have a lot of prompt orchestration and tooling and stuff like that under the hood that always creates the same tech stack in a reliable way because they're focusing on front ends. And in a larger organization where you have a design system and so on, you can also prepare an environment like that to make the prototypes that they generate closer to what I already need.

I mean, it's still not a given that then I can just take a prototype and put it in production, but it can just make it closer to that. I also hear stories from teams in Thoughtworks about using a design system as a source of knowledge for a coding agent, or even one step before that, maybe almost like reverse engineering or retconning a design system based on the existing code base so that you then have a nice new abstraction that a coding agent can use and so on.

Olimpiu Pop: That's a good idea. And because you mentioned Figma, I have to ask about MCP. We do use MCP because that was the only way how to interact with Figma, at least a couple of weeks back when I last looked into that. What do you think, is MCP still a story or not?

Birgitta Böckeler: Yes, maybe we can open a broader conversation about context engineering while we talk about that. Because the term wasn't even around, or at least it wasn't as talked about last year when I spoke at Con, again, if we want to compare to what was around about a year ago, I think it started gaining traction around June or so last year. So context engineering, meaning you try to tune what the model will ultimately see so that you get better results, what it will see, what it will not see. So that's a simple definition. And of course it means different things like when somebody's actually building an agent or something, it has very different concerns than when somebody's using a coding agent. And when we're using coding agents, we basically want to put in our coding conventions, maybe something about our architecture, our general business context.

So that's part of the context that we feed it before it even generates stuff. But then we also give it these context interfaces. So we tell the model, "Here's a bunch of tools that you can use", that's the most generic term for it, and the interfaces of how you can call it, and one of those things is MCP servers. By the way, dear large language model, you can access things that are in Figma, and here's the description of how you can do that. And then the models are trained to be good at creating the structures to calling those MCP servers. Other context interfaces are, for example, the tools that are already built into the coding agent, like how to edit a file, how to do code search and so on.

But then the third one that has come up over the past only a few months, I think, maybe end of last year is skills. So skills is also kind of context interface because it's basically initially when you start a session, the large language model just sees a list of skills and descriptions. So for example, the description could be how to get Figma designs from our Figma system or whatever. I don't know if, maybe for Figma, MCP servers are still the best way to do that. I don't know how exactly the integration with Figma works in an MCP server. It's probably calling the API. But with skills now coming around, a lot of people have moved their MCP use cases to skills and to CLIs or scripts, because now in a skill you can actually package up multiple files, not just a markdown file, but you can also provide documentation or, like I said, a script or something. So you can say in this skill is also a script that can call the Figma API and you can call the script in the following ways.

So that would be another way to create that Figma integration. So the reason why I think people have shifted towards that and are not defaulting to MCP anymore is one, that context interface takes up less space in the context windows. So you just have the description of the skill, and the agent only loads everything that's in that skill folder when the LLM thinks that this is relevant for the task at hand. When the LLM is like, "Oh, looks like I might need Figma designs, let's load this whole folder of resources of how to access Figma". So it's kind of a progressive disclosure or lazy loading of context. So that's one reason it's more efficient on the context window. And then the second reason why people are moving to it is I guess a little bit of this when you already have a CLI, like let's say you have the AWS CLI installed on your machine, why would you run another local process like an MCP server that can access the AWS cloud when you already have the CLI on your machine and the agent could also call that CLI.

There's of course, with an MCP server, you still have almost like a bulwark towards the other system. You can put more permissioning in there and maybe you don't want to give the agent access to everything that a CLI can do that is installed on your machine, so there's additional considerations. But that's basically what that shift is about from MCP to maybe more skills and CLIs and scripts and stuff like that.

Olimpiu Pop: Yes. So to just summarize a bit, we moved from having monolithic, let's call it, claude.md or agent.md, and probably in the smaller pieces, like smaller modules like skills in which you can define it and then you just have links to those in the upper initial index, whatever. And MCP is still useful, but at points where we do interact easier with other applications. So another use case that I was looking into was in terms of hardware design. And then you have 3D models and stuff like that that is a lot harder to achieve at this point, but it's good for translation. So I think those are the cases like you have established tools like Sigma or, I don't know, 3D StudioMax or whatever, and then it's a lot easier if you need those kind of things to interact with them.

But now for me, it feels that is a very meta-agile development because MCP was a stepstone. As you used to speak some time ago, you had a thought process about protocols and then that was going through multiple phases. Now it seems that things are popping up as solutions. They are not useful anymore and then they are discarded and everything is moving so fast.

Birgitta Böckeler: Yes, and I would say there are still use cases for MCP. It's just that for a while, everybody defaulted to it for a lot of cases and now we have more tools in the toolbox and we can decide what we want to use.

The Resurgence of TUI for Agents [14:16]

Olimpiu Pop: I'm pretty sure about it because on top of that, there are the other agent protocols. I think it's whatever, agent to agent, A2A protocol. And talking about tooling, I spoke a couple of weeks back with Wilmer Cogan, and he is one of the people that is very established in the TUI space, and he said, "I like AI coding, but I don't like how user interfaces look like in the terminal". So he put together what he calls both. And what he said is that I took all the richness that I have from Textualize and other very intensive textual user interfaces and tried to build something new that runs with everything. Obviously, Claude then creates some kind of gimmick that you have to use their tool. And there seems to be that this push is bringing together a new Renaissance, if you want, of terminal application.

Birgitta Böckeler: Yes. So at Thoughtworks we have this publication twice a year, the Thoughtworks Technology Radar, and we just had our discussions a few weeks ago about the next edition, and we had one or two frameworks come up, I think, to build terminal UIs. And at the moment, the most popular open source coding agent, arguably, is OpenCode, which for a while I never had time to try it out. I just heard everybody talking about it, how great it is. Before the open source ones that were really popular were like Klein and RuCode and Continue, but now OpenCode is the one that has settled as the leader in open source and terminal-based UIs. And there are quite a few things that are different about it that actually makes it feel a bit nicer in terms of the experience than Claude Code, to be honest.

Olimpiu Pop: Yes, I think there are plenty out there. And back to your point, I think it's context engineering is something very important at this point. It's how you can create it. My feeling is that Claude is able to create that kind of ecosystem where you have a lot of stuff. I was talking to somebody the other week and he was saying it's that they managed to integrate Claude in their company at the enterprise level, and then he's just going to Claude and says, "Okay, what are my tasks for the week?" And then having access to the inbox, the Jira, and so on and so forth. And it seems that people are now getting highly reliable on these tools because there are like some guys said, "Okay, I'm on the plane. How can I code now given that I don't have access to the internet or my tokens run out?"

The Struggle of Local Models with Agentic Experience [16:41]

Birgitta Böckeler: Yes. And that is also something that hasn't really changed or where we haven't seen a big step change over the last year is in the use of local models, at least not at that kind of cutting edge agentic experience. So there are now ways to connect even Claude Code to local models, like model environments like Ollama that you can run on your laptop have started supporting the API that Claude uses to call the model. However, it's just like even if you give them a higher context window than Ollama provides by default, even if you have reasonable ... Not reasonable, but high amount of RAM on your machines or in general, like a strong machine, it's a lot slower. And also the smaller models usually still struggle with the tool calling. So it will be stuff like they don't really understand what the tools are that they can call. So all those context interfaces that I talked about before, like the editing tools, the MCP servers, the skills and so on.

So either they just don't even discover them or they know that they're there, but the models call them with the wrong format or all of those things. So a model has to be really strong at that to support a good agentic experience and a lot of the big models actually fine-tuned as far as I know for the tool calling as well. And then in general, the coding agents also have relatively complex problems, really big system prompts and stuff like that, and that's also what some of the smaller models then struggle with. And they go into loops and just hangups and stuff like that. So unfortunately, we still don't get that cutting edge experience with the locally running model.

Olimpiu Pop: Yes. Well, it'll be probably coming because obviously there are now with the whole sovereignty movement all over Europe, you see people that, okay, it was a bit funny for me, it was like, "Okay, we don't want to be reliable on Claude or Codex or whatever. So we just decided to use Qwen". Well, interesting approach for privacy, dropping Claude or Codex in favor of a Chinese model, but that's also something interesting. Anyway, that's a whole different conversation. Another point that you had was harness engineering that probably moving along, we'll not discuss about just using something or something else, but we'll just look for those programming language frameworks, whatever, that have the best harness. And given that harness was used in multiple spaces, let's try to coin that and make sure that we know exactly what it means.

Harness Engineering: Increasing Confidence and Reducing Supervision [19:12]

Birgitta Böckeler: Yes, so that's a relatively new term that is gaining traction, at least in the coding agent space. So with the more powerful context engineering, and obviously also we've had step changes in how good the models are at coding as well. So with those two things combined, it has just continued the push to give the agents more autonomy and to reduce human supervision. But if we want to do that, if we want to reduce our supervision, then we have to increase our confidence in what the agent is creating. So this term harness engineering has started popping up. I actually just wrote an article about it as well. And the first thing I did was I was trying to understand what does that even mean? In which context, in which bounded context is this term harness being used?

It seems like originally it was used as like in the context of talking about evaluation harnesses for large language models, and then in the past few months it's been used a lot to say, okay, a harness is everything except the model. So an agent is actually a model plus a harness. So that's like a really wide definition. So people would, for example, call an agentic framework like Pedantic AI a harness, or they would even call Claude Code a harness. So anything that orchestrates everything to create that agentic experience. But then if we take it to the bounded context of using a coding agent, so like us actually working with a coding agent, then we can actually also build a harness around all the harnesses that we're using under the hood in the sense of constraining the agent with different tools and guidance to get better results.

So what I'm talking about is here, one, like what we feed forward, what I talked about before, like we give it coding conventions, we give it context about our architecture. So everything we give it to kind of ... We anticipate that the agent might not know this and therefore it might do something wrong. So we're trying to increase its chances of success in its first generation. And then secondly, once the agent generates something, we want to immediately give it feedback without even being involved as humans. We want to immediately give it as much feedback as possible so it can self-correct. So those would be things like giving it access to static code analysis, or of course it should be able to monitor if the test suite is still green or stuff like if we have a strongly typed language that it can immediately see when there's like a typing issue, when there's a compiler error, stuff like that. So we want to give it as fast of feedback as possible.

And all of those things is kind of like a mix of both of course, again, AI based things. So to give it feedback, we can also start another AI agent that reviews the code. But we can also just give it access to the tools that we also use or maybe underused in the past as humans. I said like static code analysis, maybe even more advanced static code analysis, custom linters, stuff like that, which is now easier to build as well. And also in the feed forward, when we try to make it more successful in the first place, we can also use, let's say, CPU-based computational things. We can give it access to a language server, for example, so that it can maybe do IDE style refactorings by renaming a symbol or something instead of doing it with a text diff.

So a harness is all about strategically thinking about what of these feed forward and feedback things you put in place so that you feel more and more confident to reduce your supervision or to direct it to the places where it's most important that humans look at it and you don't have to fix all those small things that maybe an agent can just as well fix itself as long as it has access to feedback.

Olimpiu Pop: Okay, so actually harness is a box that until not long ago you felt that is the know how or the different thing that you have, like best practices tooling and so on and so forth.

Birgitta Böckeler: So far the developer was the harness.

Olimpiu Pop: So now we are moving away from that and the harness is actually those kind of things that normally you'll know, and that's the next step, as you said, giving the model more autonomy where you can focus more on the higher level stuff and just on the result.

Addressing the Behavior Problem in Agents [23:21]

Birgitta Böckeler: And there's been a lot of blog posts and writing from teams who are trying to push into this direction recently, but most of that writing is all about a harness that looks at maintainability, internal code quality, stuff like that. And all the examples I just mentioned around that as well, like static code analysis and stuff like that. But I think then over time we can push that into different areas like architecture fitness. We can give it access to architecture fitness functions. And maybe it's not all while the agent is initially working, but it can also be later in the pipelines. There was a really good article by a team at OpenAI who've been working on a code base for five or six months in this manner, and the main thing that they work on is this harness. Every time something goes wrong, they try to improve the harness first before new code gets written.

But there's not that much in all of that yet about, let's say, a harness for the behavior. So I think there's a lot of ideas where I can totally see a path, oh yes, we can push this, there's quite a lot of buffer, we can do more custom static code analysis, we can integrate it with fitness functions, all of these possibilities. But with the behavior, it's still kind of like a little bit lackluster. So what I see most people do like as feed forward, they write a spec, whatever that spec might be, it can be just like a prompt or it can be like an elaborate set of 50 markdown files. And then for the feedback, they have the agent generate tests, and when the test suite is green, that's enough. And then you do some manual testing and that's it.

But I think that's very unsatisfying to me because the agent also generated the tests. So maybe then I have to think about feedback for the test. So that's maybe test coverage. There's a resurgence of mutation testing as well, which is kind of like a test quality feedback thing, but you maybe only want to run that in the pipeline and not early on because it takes quite a lot of time. I think there's still a lot of questions about behavior in particular also because as a human, I often, when I write the spec, I sometimes don't even know yet what I need. I might think that I know what I want, but then only when I see it, historically we know we haven't been good at writing specs. So how are we going to do this validation thing? And all of these things, of course, for a lot of people, the ultimate goal is to have an agent go off by itself and you don't even look at it anymore and you hardly need any humans anymore to build anything. They all write the specs.

I think even if you find that extreme or if you think that's maybe a far way out, all of these techniques are also super useful for like a supervised session. So for example, I've been experimenting with just having a little companion running next to my coding agent that constantly shows me the result of static code analysis, test suite, test coverage and so on. And the agent can also access it and gets like an agent optimized summary of that state. And this gives me an extra level of trust into what's going on. I don't have to look at every single thing or wait for the pipeline to run, but I can always see my definition of good quality as far as a computer can evaluate it. I can always see at which state that is and has it gotten better or has it gotten worse, and the agent can also see that.

Olimpiu Pop: Okay, so that's some kind of a gauge on your car screen, whatever, to just see how it feels.

Birgitta Böckeler: Yes, like a heads up display almost. So for example, when I have an agent running, doing something for a while, maybe in the meantime I've done something else, I can go back and at least I get a first glance of, is it even worth it yet for me to look at this or are 50% of the tests failing or has the test coverage dropped 10% even though the agent wrote a lot more code or all of these kind of ... Almost like a triage thing, like the low level things that a computational CPU-based tool can also check for me, are those all green? And we had all of those tools in the past, but we often underuse them or maybe it was sometimes just not worth it always to use them because we were the harness and we were going in small steps. But I think we can revisit a lot of them to see where it's worth investing in them because it's worth replacing some of our supervision with them.

Risk Assessment for AI-Generated Code [27:30]

Olimpiu Pop: I strongly agree with that, and that's what I'm pushing for almost three years since more coding with the agent started ... Well, in the initial phases just using ChatGPT or simple things like that, now it has a word and that's a harness. You have a safety net that assures you that all the things are going there. Where it's a bit divergent to the ideas and the thought process I had at that point is that currently you're using only one model because at that point I was doing something like, okay, I'm writing the tests as a human or I'm writing the code and then you had two different points. But now it feels that the agent is doing everything. Obviously it has multiple hats and you're just saying do now adversarial code review or stuff like that. But still for me, it's a bit, I'm not that confident, not that comfortable into going all in only one tool, but it is just a matter of your idea.

Birgitta Böckeler: Yes, and it depends on the situation. I also was talking about at Con, my mental model about how I decide what workflow do I use, like a simple one or an elaborate one, or how long do I let the agent go without supervision or how much code review do I do? So my mental model for it is always that of a risk assessment and risk is always probability plus impact plus detectability. So I think about what's the probability that AI will get it right or wrong. So for that, I have to know about the level of my context engineering, how confident am I in my specification and stuff like that. What is my experience with the coding agent I'm using? And then for impact, I think about the criticality of the use case. Is it a proof of concept or a spike? That's at the one end of the extreme or is it at the other end of the extreme a critical business workflow that I'm on call for at the weekend, so I might get called at 2:00 AM in the morning. So that's a very different impact case.

And then third is detectability. So how easy is it for me to detect when or if AI has done something wrong? So those are always the three components I think about when I decide how much review I put into something. But I have noticed that the more I use coding agents, the better I get at this quick assessment, like so many things that we more intuitively do after learning about other things in coding after a while. And especially the probability part, like the probability that AI gets it right or wrong, is the one where we have to build up this experience about what AI is good at and what it's not good at. And we have to understand at least to a certain level how a coding agent works under the hood and what tools we have at our disposal and so on. So that's the part that we have to build up with using coding agents over a period of time and then it becomes an experience that we have. So it's not something we can learn from a book, unfortunately, because of the weird nature of this technology.

Olimpiu Pop: Well, actually it's quite interesting. And looking also at the other presentations and content that we had at Con in London this year is I see a lot of things coming from academia. I was surprised to see things like formal validation methods have a resurgence. Then there was the TST, the termination simulation testing that is also coming to force because until now it felt that the testing, the validation always is falling behind, but now things are moving a lot faster and you need more tools that are providing this kind of ability to understand where it actually comes. And I'm happy to see that other people from academia are trying to bridge the gap. I'm still curious whether it'll come to any kind of mechanism that are actually pragmatic because, as you said, there is not a simple book to follow and it feels that it's more empiric, it's trial and error.

But I think there are things that currently become a reality because of this speed. But there is one other concern that I have. We are moving quite fast. And until not long ago, you had a debate from buy versus sell, but now you have also the part where it's vibe coded, and there are a lot of people rather than doing the research on internet and finding the best tool for that, well, actually I'm going to build it. And that's quite interesting and also scary from my point of view. But the biggest question that I have is now we are using SaaSes, we are software as a service, we are using mostly models that are in the cloud. We all know that they are expensive and getting more expensive. There are all the issues on the security side. But what happens with the code that we build and we want the model to know? How are we feeding that back in?

Birgitta Böckeler: You mean the code base as our context?

Olimpiu Pop: Yes. We are just changing it, but then in the end, we have a model that has twin data that is from the other day.

Managing Context: Code Search and Privacy [32:14]

Birgitta Böckeler: It's kind of like part of the built-in harness is the code search tool that is available in every coding agent. So every coding agent provides the LLM with different tools so that it can search the code base and analyze the code base and so on. And it seems like the default has become what seems relatively crude at first sight, providing the model with grep and glob and find Bash-based search tools like that, so text-based search tools. You might remember that in the beginning when coding assistants were just starting out, there were a lot of them that by default were also indexing the code base. Also, Cursor still offers that as well, for example, and was there from almost the beginning. So where you turn the code base into embeddings and have a local search over the code base as well. Then I also mentioned giving coding agents access to language servers, which might help them navigate a call stack much easier if you just want to look through the call stack than if you just do a text search or something like that.

And then products like AMP by Sourcegraph or GitHub Copilot has an integration like that as well. They integrate the agent with code search across multiple code bases. So they actually have sophisticated code search built up that's like a previous product or something that we previously had. GitHub's code search is quite powerful, for example. And then you give an agent that. So there's a bunch of these different ways to do this. There's also what we see and what we also do in the legacy modernization space a lot is like taking a whole code base and actually loading it into a big graph so that you can start doing analysis. And that loading it into a graph also includes tools that take the abstract syntax tree of the code base and then enrich it with all kinds of things. Or tools like Unblocked, for example, they load the code base into a graph, but they also load your tickets, your Slack history, your Wiki and so on.

So they're trying to create connections between these things so that the agent has access to more nuanced and semantic information about, oh, we'll have to work on this piece of the code. That means there's all of this other context that belongs to it and that can potentially help it make a better plan. So I think even though the default has turned into like, let's do Bash tool-based, text-based search because it's actually quite powerful apparently, there's still, I think a lot of situations, especially when you have a large code base, where you might want some of these other approaches to have more powerful search that actually is aware of how the code is structured and how to navigate it and not just based on text.

Olimpiu Pop: And not to flip it. The other day, GitHub Copilot sent an email where they just explained that your code base will be used for the training purposes of the model, and everybody was just publishing on social media. How can you protect your privacy in this world where we're using coding agents?

Birgitta Böckeler: I was a bit confused by that. I don't know yet exactly if I understand what's going on, to be honest, because that setting that everybody freaked out about is in the Copilot settings. And I thought that the setting relates to when you're using a coding agent or Copilot, you're sending it parts of your code because it gets that context. So I didn't understand it as by default they're just training models on all of the code that is on GitHub. But as the backlash was so big, that setting was there before. So I'm not quite sure yet. I haven't looked at it in detail what's actually going on, but I was a bit confused.

Olimpiu Pop: Yes, it's odd because this thing happened last year as well, and Copilot was more in focus, but it seems that this is a recurring topic and-

Birgitta Böckeler: Yes, and in the Copilot for Business license, usually that setting is even switched off by default or at least used to be, and on an organization level, people can control that. So usually all of these big coding assistants, they give you options to switch this on or off. But of course, sometimes when it's defaulting to on, that's always quite annoying if it's an opt-out and not an opt-in. So maybe that's what the backlash was about, that it was the other way around before. I don't know.

Olimpiu Pop: Yes. Well, it was a little like that and it changed, but yes, let's see what happens. Any predictions for the upcoming 365 days? I know. I know. It's hard.

Predictions for the Next Year [36:43]

Birgitta Böckeler: Not predictions. I mean, I see my role in this whole ecosystem more as the person who looks back a bit and looks at the current patterns, and then I'm often stuck in what's happening right now. So I see a lot of this stuff happening with the harnesses, which is more a very generic term, but I expect that we'll fill out a lot more of those pockets going forward. Something that is very hard for me to predict because there are so many confusing contradicting signals is costs, like what's actually going to happen to cost, because it's pretty obvious that at the moment it's still very subsidized. Even when you have a quote-unquote "flat rate" that has rate limited and stuff like that, I think it's pretty clear that people often use a lot more than would actually be part of the flat rate.

But even when you use their APIs directly and you pay for tokens, like as many tokens as you use, it seems like that might also still be subsidized. And there's all of this bubble stuff going on with all these companies giving each other money and building data centers that might only be done in a few years. So it's all like ... On the other hand, also with technical optimizations, it would also make sense that this gets technically more efficient and then it might also get cheaper or it might not have to be subsidized anymore, but there's so many signals in all directions that I find that very hard to predict.

Olimpiu Pop: Yes, yes, it's hard, but I said I'll just shoot in the dark and see what comes from your side. But I think your point is currently both OpenAI and Anthropic are private companies. And the only way how you understand what actually happens under the hood is when they do have some financing around and people are just all the rage. But if you look at the way how the stock exchange markets are going, it's not exactly like that. And also the geopolitical space, it's odd these days. The latest thing that I saw in the market is the fact that Anthropic is pushing for having an IPO this year. At that point probably we'll understand better actually how much do they subsidize the cost and what's the real cost of everything and all these points.

Birgitta Böckeler: Something I hope to see more, which when I use the word hope, it might seem like a positive, but I hope to see more public stories about fails of AI coding in terms of maintainability. I mean, we're seeing there's a regular stream of incidents and also security incidents coming, dripping in. And the reason why I want to see more of those stories is not because I'm like, "Haha, you use AI and now this is happening", but because it will help us learn overall how far we can push this.

And I'm always happy to see when there are companies who are in domains where they can take more risk trying to push this more so we can learn from it, because there are a lot of domains where you cannot take as much risk, just so that we keep the period of time in which we step into it and create a lot of cost and pain and worse products maybe, so that we keep that period of time as short as possible and find out what is our right amount of speed and what is our right amount of harnessing and human supervision so that we can keep quality up and build good products for users.

Olimpiu Pop: And to add on the points that you had, what I think it will happen, we'll have more of these points becoming more elaborate. As you mentioned, we start having names now and contexts and concepts for that. The other thing that I expect to happen is having more ability of coding tools like API design specifically for agents to be able to call and more empowered documentation that allows agents to call also create a more ecosystem. So these are the things that I would expect to happen. You already start seeing those. Other than that, I don't know, it's game on as things will happen, but my expectation will be that things will decrease in speed because currently things will changing very fast, but now they kind of-

Birgitta Böckeler: Yes, I don't know. I think it will still be fast. It's been fast for a while, but it seems like there's still so much to do and to do wrong and right and ...

Olimpiu Pop: Excellent. We have a date for next spring to see where we are. Okay, thank you a lot, Birgitta for your time and for all the insights.

Birgitta Böckeler: Thanks for having me, Olimpiu.

Mentioned:

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT