Transcript
Paulo Arruda: My name is Paulo Arruda. I'm a staff engineer at Shopify. I'm currently working on the revenue data team. Started that about a couple months ago, but before that I was working on developer productivity in the augmented engineering team. I want to talk about the journey I've been through, investigating and trying to build the computers and agents that can talk to each other, basically. This is going to be more like as a story. There will be some technical tidbits to it, but most of what I want to pass to you guys is the insights that you can't read on the internet. The experience that I had particularly.
AI Adoption at Shopify
First, we start with the AI adoption at Shopify. This is fast forward to 2024, but prior to that, once GPT-3.5 came out, we just made a contract with OpenAI and then we had access to their model, and we had some internal chat tools. Fast forward to 2024, then we had it available to everyone. We had contracts with all the major providers. We had LibreChat, which is an open-source chat interface that you can set up agents, and talk to them, and you'll set the system prompt, and all this stuff that you're used to. We had VSCode with Copilot. We had Cursor, which was probably 1 year old at that point. Then there was still a significant portion of the engineers that were not using AI day-to-day, and that's a lot because folks are busy, they can't try, or they tried it once with GPT-3.5, and they had a bad experience, so skepticism and all those things came into play.
Background - Shopify's Hacker Culture
A little background here, what I really like about Shopify is that we truly have a hacker culture. That comes from our CEO Tobi. Curiosity is very encouraged and rewarded, and they give us all the tools for us to play. The way that Tobi describes it is that he likes Shopify to be a crafter's paradise. That's a really nice thing about it. My journey starts with test generation. I heard from a few talks about testing and AI. As we're seeing the AI adoption increase in the company, you know that folks will start using that to generate code, and those problems will happen. You have PRs with 1,500 lines, and then at the beginning, you're going to have the folks who will make a lot of effort to review it. Over time, if AI works well enough times, things start slipping through the cracks. Knowing this, then we think, how do we solve this problem, assuming that developers will eventually be pushing AI slop? Then we thought about testing, like test generation. If we could improve our test suite, we will be able to catch those problems as they arrive in PR. That was something that we started thinking about in about October 2024. Then I tried an experiment. We have this thing called hackdays, we have three days that we can just pick whatever we want to do, and work on basically hack stuff, and we can build teams and stuff to work on those projects. My project was to be able to generate tests well in bigger existing codebases. Shopify is just a giant monolith built in Rails. I figured that to be able to generate tests for things that don't have tests yet, you have to understand what's in the code to be able to write the right test to understand the behavior. That's what I was thinking back then. Then I did a quick experiment, just mapping relationship between files. That was just a hypothesis of one of the million things I could have tried. I chose to go that direction. Cursor, for example, they were indexing the codebase and doing search on those indexes. It was good, but it would still miss a lot of the semantic relationship between things. I don't know if you guys are familiar with Rails, but there's a lot of implied behavior in Ruby as well. There are things that are not so explicit. What I was trying to do is that I built a dependency graph between files. It was a very expensive experiment. I generated GPT summaries of every single source file. Then I put that as nodes in a graph. I got the edges, the connections between them, the relationship between constant declared and constant used, and methods defined, and methods used. Then I started generating summaries of those relationships, because I wanted to be able to find those hidden relationships that were just like vector search, pure code wasn't finding. That was an experiment. It went fine, but it was really costly. You can't keep it updated. We have hundreds and hundreds of PRs a day. To keep that index hot, it would be pretty difficult, which made that useless.
The Spark (Claude Code Research Preview)
Then a little more background. At the end of February, Claude Code was released as a research preview. That changed everything because those guys, they came out and said, no, we actually tried indexing the codebase, but we found that agentic search performs much better. I could see that with my own eyes. I was using Claude Code. How many of you guys use Claude Code? You know how it does like Grep, Read, those kinds of things, and it finds the code pretty well. It's just as good as the indexing that Cursor would do, but you didn't have all the overhead, so you could just operate in any codebase. That was awesome. Then, the beginning of April, Tobi sent an email to the whole company, and then it ended up leaking in social media, basically saying, "Everybody, we've got to get into this AI train." He said some things like, before we hire somebody, we have to be able to tell that AI can't do that job. That didn't mean that we're not hiring people. That's not what he meant back then, but a lot of what came out in articles about it, it kind of distorted a little bit. My point is that that actually fueled the tinkering across Shopify with AI. We have about 6,000 employees at Shopify. Folks were really diving into it. It's really cool to see because now today, we have, even non-R&D, non-engineers, everybody vibe codes prototypes at Shopify. It's really awesome to see what that single email did.
May 2025: Hackday During Summit
In May this year, I was in another hackday in our company summit that we do every year. Then I had a pet project that I wanted to try for code understanding. Agentic search works pretty well. What would happen if I gave Claude Code tools to freely navigate the AST? I was wondering, how would that perform? Could I have even better results than the pure agentic search? I set out to do that. What I wanted to do, I wanted to build a Ruby Gem that was an MCP server. I would connect that to Claude Code and that would give it the tools that would use Prism, which is the Ruby parser. It would just put an adapter layer to make it behave like Read, Grep, Glob, just the tools that Claude Code is used to, and somehow make those two calls work to navigating the AST instead of navigating the files. I wanted to vibe code it because I only had basically three days to do it. You're like right there, all the Shopify folk, you want to hang out and network a little bit. You don't want to be coding all the time. I wanted to do that quick. That's vibe code as a prototype. First attempt, Claude didn't know much about Prism, the Ruby parser. The documentation lacked examples. It did a really bad job. It couldn't really do it. Then I was like, I'm going to go old school. I'm going to read the Prism codebase myself. The documentation wasn't very good. I'm going to learn how to use Prism, and then I'm going to just tell Claude Code to build out of my library based on what I read. That felt like very last year. Then I was like, ok. I cloned the Prism repository, and then I started a Claude Code inside of it. You know how Claude Code does. You ask a question about like, how do I do this with Prism? Then it would just scan the codebase, learn how to do it, and give you an answer. Then I was like, awesome. I asked the Prism Claude Code how to do this, and then I just copy and paste to the Claude Code that was building my library. Amazing. I got some progress, but it was very laborious.
Then the big idea came, like, what if I could automate this? Then, what I thought, I would get one Claude Code in a non-interactive mode through MCP, connect it to my main instance that was in the repository where I was building the library, and then, that would just work. It did work. I started from scratch, and it one-shotted what I wanted to do. What I wanted to do wasn't very complicated, it was simple, but it couldn't do it by itself. This time it did. The experiment failed. It was slow. It was no better than file search, and failed hypothesis. The idea died. Wait a second. Claude Code didn't know how to do it. I didn't know how to do it, but two Claude Codes knew how to do it, and that was awesome, and that blew my mind. I was like, there's something here. Then what I decided to do was, I decided to wrap that in a Ruby Gem, and make that available for other folks in hackdays to quickly try it, and see what they could do with it. Because now, suddenly, this really cool pattern of like, if I'm working on a codebase, I can just have Claude Code instances in the libraries that I cloned, and they know how to use all the libraries, and then that makes the main Claude Code, much smarter. I wrapped it in a Ruby Gem. You could set up those Claude Code instances using YAML. They would operate in a tree-like structure, like you see on the picture there. It could build as many levels as you wanted. At that point, it was like, let's see what that does. I had no idea if that will work or not. You could put them in multiple directories or the same directory. We have a massive codebase, and one Claude Code instance can never understand the whole thing. If you build little specialists in each part of the codebase, then it works better that way. I added a feature, Vibe: true!, which would run Claude Code with dangerously skipped permissions. Have you guys tried that feature? It's awesome. Just let it go wild. That's all I do now.
Adoption
Then, folks found out in the engineering side, and they started to adopt it. I still use it today for my everyday development. Working on augmented engineering, at that point, we had a project that we had to do test generation at scale. Then, we were using that. We would use one Claude Code instance to generate the tests, and then we had Gemini 2.5 Pro to criticize those tests. We had o3-pro to also criticize it. Then, that way, I get some better results. Then, folks were using to answer questions about large codebases. Actually, it was pretty hacky, my solution. It was hard to figure out the logs, and I had to go steal the logs from the directory that create inside of a home folder. It just wasn't good. Then, around early June, I reached out to them and said, "Do you guys want to build that into your tool, because we would like to use this, and I think you can do a much better job?" Show it to them. Then early July, we had some big initiatives using it, some bunch of other teams doing other important things with it. Then I had to add multi-provider support for the test generation, like I said. We were using Gemini and o3 to verify the work that Claude was doing. Then, in July 24, Anthropic released the sub-agents. To me, that was like validation, the model works. Then came August, and the other side of the company found out, the non-R&D folks. Lots of use cases started to emerge. One pattern that I observed, because once those use cases showed up, and they figure out that I built a tool, then they were DMing me on Slack and asking for help, and I was helping a bunch of different teams to do those multi-agent things with the tool called Swarm. Then, I noticed this pattern, that what they were moving from is this idea that they had one LLM on LibreChat with massive prompts. Then, you have too many unrelated tokens, too many instructions, the LLM gets lost, and the result was very poor. Then I started helping them build those with multiple Claude Code instances.
Success Stories
The first big success I found in operations was with theme reviews. When you submit a theme to Shopify store, we have a team that reviews that. To review, they have this massive checklist of compliance that they have to verify against. First, it was just humans doing it, and then with AI, they started getting this one massive prompt with all the rules and trying to get it to figure it out. It would take them halfway there. Taking them halfway there, it was still 22 hours to complete the process. Once we broke down each one of those review criteria into separate agents using Claude Swarm, then we were able to reduce that time between 7 and 20 minutes. That was a big time-saving. Among many others, we have another example of a candidate role assessment for internal moves. Those things were consuming a lot of time from the folks responsible for that task, and helping them split that into multiple agents, it reduced the process to under an hour. Another big time savings there. Then, higher up folks started using that to answer questions that require deep research on internal systems. For example, on this example, they built a swarm that had 15 Claude Code instances that would do research on internal documentation system to figure out, what did we ship in Q2? Each one specific for a business function. Again, they had a pattern of massive prompts to try to figure that out, splitting that into separate, narrow-focused agents. It really helped them to get the answers that they wanted. Then we had more success stories. We had folks using that for product design, tracking language translations, building work breakdown structures, doing the multi-dimensional research, researching many different systems to figure out the answer to something. They started using that for evaluating vendors. This was an interesting case because vendors, they make claims, they have to send you the documentation to back those claims, but then sometimes it's a lot of PDFs, a lot of slides. You use a bunch of agents to go through those documents, and actually validate the answers that they gave to your questions based on that. Obviously, none of what I'm saying is fully automated. The humans still have to verify everything.
It was a big unlock, but it still had a lot of limitations. It was too hard for non-developers, like, set up a YAML file, go into a Claude Code instance. They have to type commands. It was just not good. Sometimes, it was even hard for developers. They didn't really want to be setting up YAML programming. Some of them weren't really happy about that. The multi-provider implementation was a huge hack. Basically, use another library, injected that inside of the framework to be able to access all the providers out of the Claude Code, and then the logging was different. It was just a nightmare. The other thing that was a problem is that, I started noticing that folks don't want agents delegating to agents. They want to do workflows as well. A lot of times, you have deterministic steps that you wanted to run, or you want to run things in stages. You have a plan stage and you're on an execution stage. You don't necessarily want to tell the AI, write that workflow in a prompt, because as you know, at scale, it will fail to follow instructions.
Job Change - Patterns Observed
Then came September, and my job completely changed. This spread across the company, and everybody was coming to me and asking, can you help my team? Then I was being shipped all over the place. I was traveling, spending all my days in meetings, trying to understand what folks were doing. The pattern was basically like an AI SWAT. "That's the AI guy. Just go talk to Paulo." That's an antipattern. That can't scale. Every team has their linchpins, the AI enthusiasts. You have to give them the power, empower them, because they know their space. I don't know how to build an agent for finance. I don't know, but they do. We have to empower those guys. Two things that I observed. Lots of duplication. Developer to developers here, we like to build our own thing. I saw all over the company, folks building the same systems. People building pipelines, the same sort of pipelines to run AI and CI. Everybody building their own workflow system. Everybody building their multi-agent system. Very fragmented AI adoption. The ones that weren't building their own systems, they were like picking, this team was using LangChain, that other team was using, you name it. There are tons of frameworks out there. One insight from this was to encourage the experimentation, but use those learnings to build the right thing. To solve the duplication problem, I said, I'm going to get those guys in the same call and see if we can start collaborating and tell each other, what is it that you're working on? I reached out to folks that are doing sponsored work that involves building AI tooling. Then I got them all on the call every third Thursday of the month, still going. We talk about what we're building and see like, ok, you're building a pipeline, I don't need to build it, I'm just going to use that, and see if we can reduce duplication like that.
Vision for the Future
A quick pause here before I address the other problem, which is the fragmentation. If things keep going the way they are going, you can imagine that at some point, the teams inside of a company, they will build their own agents and those agents will interface, will be in front of their internal products, their workflows, and their processes. Then what follows that, at some point, we're going to be like, I know, what if we could compose those guys to answer even higher order questions, more complex questions. Then the other realization came, which is what I'm calling the agent microservices architecture, which is, if we have a very fragmented AI adoption, everybody is building their own thing, using their own frameworks. If in the future we want to compose them, we're going to end up like that. We're going to have to get those agents to talk to each other using some network means like A2A, MCP. We'll bring back the same problems that exist in microservices, that we have to deal with network retries, observability challenges, and tracing. It's just like, stop right there. Let's just backtrack and try to solve this problem, not let it get to that point. Then, what I did is that I know I'm going to build my own thing. I continue with this idea of building a unified orchestration strategy. The idea was like, if everybody in the company uses the same orchestration system, if everybody is defining agents the same way, then we don't actually need to use MCP or whatever for them to talk to each other. I can just import the definitions and just run that in a single process, and that works really nice with Ruby. If you guys are familiar with Ruby, Ruby does really well using fibers. Most of the LLM work, it's I/O bound. It's just network requests to LLM provider, get the answer, then you run a tool here. A tool is generally like an MCP tool. Then you do another network request. You can run a lot of those in the same process just using fibers, fiber scheduling. Then, I built SwarmSDK. It's not related to the OpenAI Swarm, but I call it SwarmSDK because we had called it Swarm. Swarm makes it easy for everybody in the company. Basically, Ruby Gem, multi-provider. You can build the agent definitions with YAML or Ruby DSL to give more power to developers. You can build workflows with it in addition to the tree formation org chart-like agent combination. You have event hooks just like Claude Code, for every single event that happens you can hook in code or you can call scripts. It has observability cost tracking all built-in. Everything runs in the same process. Everything that happens in mid-event, so it's really easy for me to track what's going on. It has a plugin system, so you can extend. Using the plugin system, I built a memory plugin. It comes with memory as well. I'll talk a little bit about memory for this presentation.
Lessons Learned
The lessons learned. The best solution started with my own pain. I was trying to solve a problem. Treat agents like lean, narrow-focused tools, experts, not generalists. I saw a lot of folks building personas or bios and that kind of stuff. To me, that's just a waste of tokens. Unless you want to use that to control the answer the AI is giving you, for those little tasks for multi-agent orchestration, you don't really need any of that. The less tokens you use, the better. Third, don't build an AI SWAT team, build the tools to empower everyone.
Looking Forward
I gave you three things for you to remember, but I want to talk about something else. There's one more thing. 2025, I heard that it was the year of agents, and it was. I saw everybody building agents everywhere: agent, agent, agent. People are debating, what is an agent? I think by the end of 2025, a lot of folks will already have all those orchestration systems working in their companies. They will have a bunch of agents, but then, what do we do with them? It was still going to be a little bit shady. I think 2026 will be the year that we actually make them useful. Some of them are useful now, but I'm saying useful at scale through context engineering. The question that I have in my mind is like, how do we expose data to agents in a way that will maximize precision and recall? To do that, I have to address the big elephant in the room here, which is MCP. The way MCP is used today is that it just adds a bunch of tools. You add a bunch of MCP servers to your client, and then it loads a bunch of tools, and those tools come with their own descriptions and their parameters. The description of those parameters, and all of those a lot of times are not very relevant to the task you're doing. Sometimes you need like, let's say, Gmail MCP. If all you're doing is trying to read an email, you don't really need to have the tokens in your context about how to send an email. That's just waste. The ideal agent, it's the agent that every single token in its context window is related and steering the result towards the direction they wanted to go. There's this context bloat that is caused by MCP, and we're seeing a bunch of workarounds. We're seeing Anthropic is working on two things, like skills. You guys probably think, code skills, and tool-search tools. I feel like that is just moving the problem to another layer.
I don't know the answer, but I have a hypothesis I want to share with you guys. That's what I want to leave you guys with for you to take back to your companies and just think about it. I'm not sure if this will work, but it could be something. Models, those frontier models, they are heavily trained in coding tasks. Because, frankly, if you solve coding, you solve most of the problems. There's a huge interest in training those models for that. When you think about models being good at coding, there's two things. One, they have to know how to write code. For that, they need to have seen a lot of examples, so training. Also, they need to be very good at discovering code, the agentic search that we're talking about. My hypothesis is, if you could create an adapter layer, I'm calling this llm-fuse. Is anybody familiar with Fuse here? Basically, Fuse allows you to expose things as a filesystem. You don't really need to do that. You don't need to mount things actually as a filesystem and get the LLMs to read it. If you control the tools, Read, Grep, Glob, Search, and search there is like a vector search plus a keyword search with some nice ranking. If you control those tools, Write, Edit, Delete, and I'll talk about Defrag soon, you can create an adapter layer between those tools and the storage where the information is. With that, you may be able to translate those calls to whatever data source you have. I have a prototype of that working internally. I have data inside of a Postgres database. I created an adapter layer. The agents, they think that they are reading files, but they're actually accessing the database. With that, it's like the movie "Matrix." That scene that Neil picks up the phone and says, "Hey, I need to learn how to fly this helicopter." We can inject knowledge into those agents through this system. All you have to do is work on the adapter layers for each data source system you have. The last tool there was Defrag, because those agents, they can also write those memories. They can learn from interacting with you, and you can send them on a task. Give them like a web search tool and say, "Go learn everything about X." It goes away learning about it. It could use also internal systems and store everything in memory. Over time, the memory gets fragmented. Then I thought, remember Defrag? I created a Defrag tool that goes through all the memory, optimizes it, merge memories that are similar, and keep it tidy to optimize for read. Each memory has metadata. It has a title. It has tags. It has a source, if it was a user or not. It has hits. It has other related memories. When you prompt an agent, before sending the request to the LLM, it will give you back the memories that could be related based on keyword search and a combination of techniques. Then, before sending to the LLM the prompt, it will add to the bottom of the prompt like, "System reminder. Those memories may or may not be related to this query, and here's the file path, here's the title of the memory." The LLM makes a decision if they want to read that memory or not. Every time the LLM reads the memory, they also get a reminder saying, you read this memory, but those here are marked as related to this memory. Do you want to read those or not? The LLM makes a decision to read that or not. I've been getting very good early results with this technique, but again, it's experimental. I want to leave you guys with this idea up there.
See more presentations with transcripts