InfoQ Homepage Presentations Powering the Future: Building Your GenAI Infrastructure Stack

Powering the Future: Building Your GenAI Infrastructure Stack

View Presentation

Speed:

50:39

Summary

Merrin Kurian shares the architectural blueprints and organizational processes behind Intuit’s AI transformation. She explains the "fixed, flexible, free" framework used to scale GenOS across 8,000 developers, enabling 3,500+ production experiments. She discusses critical agent failure modes, the "LLM-as-a-judge" evaluation strategy, and how to build "tool-ready" APIs for the future.

Bio

Merrin Kurian is a Distinguished Engineer at Intuit, leading AI Foundation capabilities powering Classic AI, Generative AI and Agentic AI experiences in Intuit's portfolio of products. Her current role plays neatly into her two areas of interest: AI and Platform engineering. She has previously led Platform Engineering for QuickBooks.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Merrin Kurian: What we are looking at is the AI agents Intuit has put in the product experiences. We are going to look at key aspects of AI agent development. Then we will have a deep dive into GenOS, short for Generative AI Operating System, which is our platform that helps scale and accelerate all these AI-powered experiences across our products. I'll also cover the other two pillars, the people and processes that helped make this platform successful. I'll also cover a few aspects of even if you do nothing about AI agents today, what should you be thinking about for the future where AI agents are going to be mainstream.

Let me ask you, how many of you are already putting agents in production? How many of you want to do that? I hope I'm not preaching to the choir. Hopefully you will learn a few things and we can have a conversation at the end. Before I go into the AI agents at Intuit, I want to introduce Intuit. Intuit's mission is to power prosperity around the world. Our strategy is to be an AI-driven expert platform, for our 100 million consumers, small business, and mid-market customers. We help them make more money, save time eliminating their work, and help make financial decisions with complete confidence. Here are a few numbers to help illustrate the scale of our platform. I'm just highlighting the ML side of things. We have a platform generating 60 billion machine learning predictions per day. We have about 625,000 attributes per small business, and close to 70,000 attributes per consumer, which we have the permission to leverage in order to personalize the product experiences for our customers.

On the business side, our platform processes close to $2 trillion worth of invoices. 18 million U.S. workers get paid through our payroll platform. About $100 billion of tax refunds are processed through our platform. One of the big bets at Intuit, this is called done-for-you experiences. That is where all our AI and agents are powering product experiences. Again, a few metrics to highlight how far we have come. We have 80% repeat engagement across our QuickBooks family of AI agents. Going back to the saving time metric, we have our accounting agent saving 12 hours per month for customers. The tax products are saving on a yearly basis 1.7 million hours, saving data entry time. One of our agents, payment, it is efficient. It's helping our small businesses get invoice paid faster, five days on an average.

Finally, our self-help has improved, answering 110 million questions per year, which is an order of magnitude bigger than previous years. This is a view of our Intuit Enterprise suite of products. On the top, what you're looking at are the agents provided for the QuickBooks Enterprise customers. This is the payments agent I talked about. It scans the emails of our business with its customer. It understands the conversation. It even looks at the attachment to help create an invoice. All the business owner needs to do is review and send. That's how it helps the businesses get paid faster.

Another agent, this is a finance agent. I can ask the agent to look at my data, compare it with my peers in the industry. In this case, I'm in the retail business. What is my net profit margin and how does it compare with my peers? It can tell you that your margins are lower, as illustrated in this case, looking at your data, and the main two reasons, material costs and labor costs. You, being a diligent business owner, found some supplier list. Then you ask the agent, given this list of agents, what should I do next? It can recommend the next action based on the supplier and the pricing and the factors it has identified in making margins. In this specific case, again, it's saying, consider these suppliers to reduce your material cost. We also believe that AI has to be augmented with human intelligence. An expert is always at hand for you to further consult.

Agent Development: Pilot to Production

Now let's go to agent development fundamentals. Workflows. Workflows are predefined set of code paths. You have a predefined sequence in which steps get executed. These are good for offering predictability for complex tasks and consistency for well-defined tasks. Predictability, consistency is key. Agents are where automated decisions are made. Model-driven decision-making is playing at scale. It's very flexible. It is useful when you do not have a predefined set of steps. You do not even know the number of steps. Agents operate with the context, the information it has at the moment, and the tools at its disposal make decisions. It is extremely useful for open-ended problems.

At Intuit, we have seen that agents work best with unstructured data. This is text, documents, images, audio, speech, and so on. Again, the tradeoff is that for accurate performance of agents, you might want to ask the agent to consider multiple options, ask it to think better, harder. All of that is going to consume more tokens, cost you more, and also it will be latent. If you want super snappy response from an agent and for it to think and reason, that's going to be the tradeoff you need to make. There are problems, like I said, that you cannot hand code, so that's when agents are super helpful. Again, I didn't come up with these definitions. It is from Anthropic, a very good resource for you to go and learn more about agents and workflows.

Again, another definition, not from me, Google. These are the four key components of an agent. Agents obviously use large language models to reason over goals, determine the execution plan, and finally generate the response. They have access to a host of tools. These tools could be APIs, data, services. They can perform actions or fetch data. Then there's the component of orchestration, which is the brain. It manages state. It has memory. It does the whole orchestration and do the stitching work between the models and the tools in order to accomplish the task. Finally, these are hosted on a runtime, which gets invoked when, in this example, a user makes a request or through other triggers.

At Intuit, we have always worked with agents. Our first generation agents is what I call conversational assistants. These were chat assistants. You could ask questions. They'll do some data fetching and get the answers for you. Our current generation of agents, we are calling them done-for-you experiences. This is where the agents can actually take actions on your behalf. What changed? Before I go into what changed, early days, I'm talking about 2023, so about two-and-a-half years ago is when we started on this journey. It was super hard to get anything done. GPU shortage, LLM capacity constraints, and we wanted to be super ambitious, have this agent that will answer all kinds of questions from our customers. The vision was bold. We had huge ambitions, but the technology landscape wasn't just quite there. We built a whole framework around supporting multi-turn chat with a centralized planner.

The model context window was 4,000 tokens at most, so we exclusively depended on RAG to make any progress. We built our own frameworks, like I said. It has to support multi-turn conversations with a subagent asking follow-up questions. All of these had to be orchestrated with memory, and the troubleshooting that is needed to make all of this happen and test continuously and evaluate. It was a struggle, but good news, over the last two years, things have improved drastically. LLMs are a lot more capable, so fundamental shift happened when LLMs started supporting function calling. Now they can do query decomposition, help planning, and you can use the LLM right away to split your incoming request into as many subagents as possible without doing a lot of work.

People have told me again and again that a lot more code can be deleted now because these functions are folding into the LLM APIs themselves. That's one example. They can do structured output, which is how now you are able to chain multiple LLM actions together. They became multimodal, so now it's not just text. You can have reasoning over image, documents, audio, video content. There was one model size that fit all back in the day. Now there's a model family. You have a choice of picking the superior reasoning model for your planning and then a workhorse model to do all the tool selection, and finally a lightweight, fast, cheap model to do all the natural language tasks for summarization, for example. Frameworks have definitely evolved.

Back in the day, I keep telling this, we had LangChain version 0030-something. I had never seen something put in production with a version number 0.0.something, but that's all we had. Things have improved a lot. There's a lot of choices. You have a variety of frameworks all the way from if you want to go completely declarative to if you want complete control over the graph execution, all kinds of frameworks. These help in state management, checkpointing, multi-agent systems, all the challenges I described earlier. These are easily solvable with today's frameworks. Agent communication protocols, they're getting standardized, or at least there's some consensus about agent-to-tool, agent-to-agent. These were, again, extremely hard problems to solve. We had to get everybody together, define standards for every single interaction within our company.

Again, the biggest challenge was, yes, you built all this, how do you continuously troubleshoot? How do you continuously evaluate? How are you collecting all the traces, and be compliant with all the security standards? That was a super hard problem. Again, tooling has improved a lot. If you are starting on this journey today, good news, you will not have to struggle as much as we struggled in the past.

Our strategy has always been to adopt standards as they mature. We do not believe that we will be able to solve all the problems and stick to that forever. For example, the first standard that we adopted was the OpenAI chat completion API. We saw that every agent framework was supporting that standard. If we make our APIs compatible to that standard, then any agent framework was naturally available for our agent developers to experiment with. Until standards evolved, that took a journey of two years and three versions of the API that we supported. Until these kinds of standards evolve, we will continue to experiment. Sometimes I think if you wait six months, maybe things will get sorted out. You don't have to try this hard.

Then we want to stay on the cutting edge and continue to learn as things evolve. We always provide experimental solutions for us and our internal customers to learn. This is something we follow as a practice. There's a fixed, flexible, free framework we adopt for any technology choice. There are concerns that are standardized and fixed for any Intuit engineer. These are platform concerns that people should not have to repeat over and over again for developing their agents. The flexible options we provide, they are flexible for a reason. They are not unlimited. They are flexible, so we have an opinion on which are the options. You will have options. These will be compatible with the fixed set of technologies that we choose.

A word about, now I'm calling them traditional applications, our applications which do not use LLMs versus applications which use LLMs. One of my teammates was able to generate this cartoon using AI. I'm so grateful for him. I could not get this creative. Our traditional applications have deterministic outcomes. Why? Because somebody had handwritten all the code. It is easy to define what is success and what is failure. You can have well-defined tests and well-defined testing criteria and acceptance criteria for traditional applications. They are easy to debug, relatively speaking. Of course, there are race conditions and all the other issues. When compared to a simple LLM-based application, they feel like cakewalk.

On the other side is now our LLM-based application. You write very little code. A lot is happening inside the LLM, which you don't know about, you didn't write about, you didn't influence. Pass-fail criteria become hard and very subjective. The fact that it takes natural language input and natural language output, in other words, English in, English out, makes it extremely subjective. Yes, I asked the LLM or the agent a few things. It worked. Doesn't mean that it's going to work everywhere. The testing criteria is now becoming extremely ambiguous. You need the domain expert to have all the coverage across all the scenarios that the agent should address. It is continuously evolving, because when you start working with the LLM, you do not know what the LLM is going to do. They are general purpose models. They are now, therefore, extremely hard to debug, simply because we do not know where the lines of code are, where the breakpoint is going to hit.

This is now well-researched and documented. In a multi-agent system, there are so many failure modes. Typically, you wouldn't do failure mode analysis for a small application, but that's what it has come down to. If you build an agent, you have to know that the agents may obey or disobey the tasks. They may forget what their role is. They may continue to repeat the steps, even after it has processed the task. They may forget what happened in the past. They may continue the conversation, and they may not be able to stop, again, not recognizing. They may fail to ask clarifying questions. They may drift from the original objective. They may withhold info from a subagent, for example, or they may ignore input from a supervisor. All kinds of things can happen. They may stop early. They may not stop. They may validate. They may validate incorrectly. Again, well-defined, well-documented. It's up to us to now understand these failure modes and address them.

What could go wrong? I wanted to book a trip to San Francisco. The agent booked me a trip to San Diego. If you are only testing the final output, all you are going to know is it failed. You do not know what you can do as next step. In this case, the first step was to find out which tool had to be called. One failure mode or failure point, it called the wrong tool. Next one, it had to call a search API. It called with the wrong arguments or the wrong values. It had to use RAG, but either maybe it did not use RAG correctly or it provided the wrong context.

Finally, when it responded, maybe the tone was not appropriate. Finally, the overall correctness itself failed because the trip is now booked to San Diego. Unhappy customer, lots of tokens spent, wasted resources. This is what agent is going to look like, if you do not know what you're going to do. What should we do? All is not lost. It's just that now software engineers are trying to be AI developers. This is something all the AI science folks have always done, evaluations. You need to have systematic structured evaluations. There's an easy technique. Let's use an LLM as a judge. Now, it all depends on the judge's quality. Does the judge know as a human to make the right judgment? It doesn't stop.

The number of things you need to do after writing the agent code is what actually makes your agent successful or a failure. Like I said, evaluating the final response is not sufficient. Now you have a trajectory. You need to capture all the traces in which the LLMs made all the decisions. You need to evaluate each decision point with the right ground truth, with the right judge, if you want to scale this beyond human. That becomes key. Like I mentioned, objectivity of this ground truth is going to be a huge challenge if you are not the domain expert. Again, none of these are new, but this gets amplified because the developer is not in control of a lot. Evaluations are not fixed in time.

Maybe the first few times you ran the agent, this behavior was not observed, but then you deployed to production, there are new emergent behaviors that you didn't anticipate. Or the customer needs have changed. Now you have to keep updating your eval datasets to match the expectation. You need to have this regression test suite, that has to be continuously updated. It's not like, I wrote this unit test, it passed, therefore it's going to pass forever. That's the biggest challenge working with LLMs.

GenOS: Enabling Acceleration and Scale

How are we solving some of these challenges at Intuit? We built a platform called GenOS, again short for Generative AI Operating System. That's how we are helping acceleration and scale in developing AI agents at Intuit. The key components are, we call it an AI Workbench, that's our development environment. We have GenRuntime, GenUX, that's the user experience aspect. Of course, we have large language models. Whatever we cannot solve with technology, we augment with the right processes. We have a responsibility and governance process to oversee the whole thing. It should be clear why we built GenOS, but I'll say it again. We want teams to move fast and build things. They do not want to be bogged down by all this compliance, data handling, security that we wanted to abstract into the platform. We also didn't want them to go off and try to solve it in 100 different ways. That helps in velocity.

Back when we started in 2023, and even today, there is no enterprise-grade end-to-end AI platform that can solve for the kind of businesses Intuit is in. It's always a constant evaluation of, is this thing now available so we can mix and match something or replace some homegrown component from outside? We always evaluate that, but there is no end-to-end system that solves all the problems. We continue to invest in GenOS. We didn't start from scratch. We have been investing in platform data and AI at Intuit for a really long time. Our goal is to unify what already works and enhance existing capabilities, for example, authorization. How do you make it now work for agents? That's an enhancement the identity system does. We wanted to also help our organization keep up with the rapidly evolving technology landscape without everybody at Intuit trying to do it.

As a central platform, we do it, and we provide all these capabilities so they can rely on us to help them be on the latest and the greatest. Having a very streamlined end-to-end application development paved road really helps us to serve a wide variety of customers. We solve for responsible AI development, secure private access to LLMs, out-of-the-box guardrails for security, safety, privacy, compliance, and enable rapid experimentation at scale through both handling data correctly and also end-to-end observability analytics via instrumentation. Again, we are not going to be able to solve everything, so we need to make the platform extensible. We provide plug-in mechanisms for the business or the product teams to plug in their domain capabilities, their knowledge systems, into GenRuntime in a way that the platform continues to be extensible and gets enriched as more and more use cases come and plug in their capabilities. Also, guardrails, evaluation metrics, and all these components can be continued to be extended based on these use case teams.

We have 15-plus models across 70-plus versions that we currently host. We also do fine-tuning for very task-specific use cases. The reason we do fine-tuning is to help manage accuracy, cost, and latency. One example is QuickBooks transaction categorization. We have personalized model for every small business to categorize their transactions. Why? Because the transactions are not going to be categorized for a landscape business the same way it is for a real estate business. Or even if you are in the same landscape business, your accountant may have a different way of doing it than somebody else's accountant. There is a huge amount of variability.

We used to take all the historical transactions and develop personalized models in the past. With the advent of LLMs, we got the opportunity to fine-tune the specific context because LLMs already know accounting, they already know all the businesses, they know all the types of industries. We have now operational simplicity because millions of personalized models can be replaced by a handful of fine-tuned models. Also, previously we had to source millions of data points for training these personalized models on a periodic basis. Now we only need much less, thousands of samples. Here is where it comes to play inside QuickBooks. The transactions can be categorized based on the specific business they are in.

A high-level overview of the GenOS architecture. We will go into specific sections. On the top right, you have the GenUX component which is powering all the user experience elements AI agents need. Then below that, you can see GenRuntime, which has its own set of components. Then we have, on the left side, the developer tooling, AI Workbench and Agent Starter Kit. The color coding is that everything green is something we built on our own, but it is built on top of the underlying gray color. It indicates that we had a lot before, we didn't start everything from scratch. The blue color is where the product development teams are developing their own agents or they are developing capabilities which they can plug in. Let's start with the developer experience in AI Workbench. We have prompt management, evaluation, optimization as one category of experiences. These are self-serve. We provide all the pipelines. Whatever work they do in the developer experience in AI Workbench are preserved in the registry.

For RAG, again, we have a pipeline that works behind the scenes. All you have to do is bring in your content. We support chunking, embedding, and indexing of all the content, and we give you the APIs to retrieve. We have labeling. We have eval frameworks, eval tracking, end-to-end tracing, and guardrail testing. Like I said, GenUX consists of the UX capabilities that's embedded in the product experiences. We have all the widgets. We also have the interaction management, which captures all the interaction data, which can then be used to evaluate the performance, analytics, KPI management, all of that. We have multi-agent systems, single-agent systems. Multi-agent systems, for relative purposes, can say that they are working with A2A, and these agents are sourced from an agent registry. Same way, the tools can be sourced from a tool registry, activated through MCP. This is a GenRuntime. I talked about the registries.

All the work you do in the developer experience is then stored in the registry, activated at runtime. We have use case registry, which is providing the end-to-end instrumentation and automation. That's how we are able to provide the end-to-end observability and analytics. We have prompt registry, as I mentioned earlier, agent registry, and tool registry. Domain capability teams provide their APIs as tools, which are registered with our tool registry. Likewise, their data sources are registered into our datastores. Their knowledge systems are registered into our RAG system. We also help with some amount of context engineering for all these agents. We have LLM service with multiple modalities, multiple interaction patterns, and this is where all our controls and guardrails are baked in, so they are in line with our LLM APIs, so you don't get bypassed.

Finally, we also, of course, provide evaluation, monitoring, tracing, logging support. All our existing policies apply, and we continue to add based on the need. Agent Starter Kit is what we launched the latest. This offers a CI/CD process for developing agents. Yes, we had all those individual capabilities. We realized that it is hard for people to make sense of all these building blocks and then also read from the internet about various agent frameworks and put it all together. We decided to package it all in one with all the default configurations that points to all the existing capabilities. We also give starter code, various patterns, reference implementations. It was really easy for everybody to just get started. In our first, so this is company-wide hackathon, we have a one-week hackathon every six months, more than 900 downloads of the Agent Starter Kit were done, and more than 100 demos were done by end of the week.

That's how fast we help people get up to speed. It also comes with the debugging, tracing, offline evaluation, registration, integrations. That's how everything was baked in. They didn't have to go read documentation, talk to people to figure out what they should do. This is our AI Workbench developer experience. Again, you can see all the tooling you need are right there. In the middle, you have guided steps. On the right side, there's some metadata. The fact that we put all the end-to-end developer tooling in one place with self-serve onboarding and guided workflows is what accelerated the experimentation at scale.

I do want to talk about three main experiences we power. In Intuit, we treat prompts as first-class entities with their own life cycle. In a lot of cases, we have seen that there's a cross-functional team who may or may not be working in Git. There's a marketing person who writes a first version of a prompt, non-tech person. Then there is a scientist who wants to optimize the prompt. Then there is an engineer who integrates it with an application. For those use cases, there is collaboration needed across teams. You need to externalize prompts. We also want to govern and manage these prompts, also provide the right observability. That's the purpose of our prompt management solution. It also helps people engineer, evaluate, and also optimize. Again, prompt portability is still a challenge. Moving from one LLM to another, you still have to rewrite all your prompts. You still have to re-optimize all your previous optimization.

Some of that work happens in our prompt management solution. Of course, it also provides versioning, templating, and other support. The other out-of-the-box capability we offer is RAG, one pipeline for indexing, where you bring in your own content. We offer chunking strategies you can choose from. We offer the embedding models, and we index them in our vector store. During retrieval, again, the embedding models do their embedding, retrieves the right chunk based on various algorithms. Then you can give your custom prompt to finally generate the response you need. I mention this because this is an out-of-the-box self-serve capability. We don't expect every team to do this by hand all the time.

Finally, evaluations. We have pipelines which help people define metrics, collect data, compute metrics, and generate a report so that they know when is a good time to launch. Again, in a developer-friendly or leader-friendly way, not like an AI science-friendly way. That's something we had to do a lot at Intuit because this is the basis for, again, launching to production. We want to do continuous monitoring in production as well for the performance of your GenAI applications or agents.

Beyond Technology: People and Processes

I talked quite a bit about the technology piece. Again, every time I present GenOS, people ask me, how many people did it take? I do not know. Hundreds of people is my easier answer because this was powered by a lot of people and process outside of just technology. We had leadership support right from the CTO. The CTO decided that there is only one GenOS for all of Intuit, which is a novelty except for certain capabilities. Typically, the business units develop their own platform capability. That level of commitment is what helped us achieve the scale that we achieved. They also established very clear decision-making, escalation processes, and forums to discuss and review. These all helped us move very fast, given the landscape was also moving very fast. We also always adopt that fixed, flexible, free framework I talked about. Having opinionated paved roads for fixed is another way we were able to move very fast.

Like I said, multiple vice presidents' organizations came together in a unified mission. That's how we were able to focus and get things done. There is always a core team. There are other teams always joining the mission, accomplishing their individual tasks, leaving the mission. That keeps on changing. The core team makes sure that we are staying as nimble as possible, pivoting as quickly as possible, because the technology keeps changing, our customer needs keep evolving. Like I said, we unified a lot of capabilities, but also our processes are aligned with those capabilities. The processes, as well as the technology, the framework, all work in unison. We speak the same language. It's a partnership across legal, security, privacy, compliance, and engineering, and possibly others. We have been fortunate to get contributions. For example, there are programming languages in which we don't have client libraries. Those have been contributed from the teams.

Everything did not work as you imagined. We started with very stringent rules, very air-gapped solutions. Of course, we got feedback. We are an internal platform, we get instant feedback. We had to open up ways for experimentation outside of what we offer as a platform. That actually worked out well. Because as a platform, we cannot invest in every possible direction. These early adopters who find something fascinating, go explore, and then they gave the learning back. Then we can make an informed decision on which direction to go. We don't just look inwards, we also look outwards. We bring in, from time to time, industry leaders or vendor partners to give us tech talks and workshops. Wherever possible, we will also try to influence their roadmap so they can provide us the solutions we need.

Like I mentioned, there is a company-wide hackathon every six months. We make the best use of it by targeting releases for those and creating workshops exclusively for those. That's when we get the company's attention. We have sample apps, reference implementations, best practices, all shared in workshop-style format before the week of the hackathon. That's how most people want to leverage all the capabilities we build and demo during the demo day at the end of the hackathon. Naturally, we have to constantly communicate what we are doing all the time. We make the best use of all possible forms. We do podcasts. We do bite-sized videos, like TikTok-style videos. We have tried multiple things.

We didn't always get it right. Like I said, we started with very inflexible APIs, which didn't help because people wanted to explore the latest and greatest all the time. We found the standards as they evolved and adopted. We can help our community try everything open source as much as possible. Our air-gapped solution for experimentation didn't work. We had to evolve our thinking, define certain guardrails, enable rapid experimentation based on certain parameters. Our review process was extremely comprehensive and rigid. People went through all of that multiple weeks, finally to see that the customer didn't like the idea. We pulled back. We defined review process to meet the life cycle in which the experiment or the solution is developed.

If it's an early experiment, no need to go through all the review. If you are scaling out, yes, absolutely, all the reviews. Like I said, this is not something Intuit does all the time. There was always a skepticism about a centralized platform. The common notion is that it's going to slow everybody down. Over time, we delivered and earned the credibility by offering all the flexibility people asked for. We were able to convert most of our highly opinionated customers to customers who actually look forward to our guidance now. That's a win, in my opinion, because now they are not going off on a tangent, escalating left and right. They know that we got this. This is what we achieved. We have about 8,000 developers at Intuit, 1,300 of them are building on top of the platform. We have launched so far 3,500 experiments in production. We get 450,000 requests per day. In August alone, 4 trillion-plus tokens were consumed.

How to Prepare for the Future of Agents

I will only say three things. If you are brand new to the area of AI, it's about continuous experimentation. I had asked somebody, what's the hallmark of a true AI company? I was expecting the complexity of the models, the deep learning, the transformer. No, he said, it's how fast can you experiment. It's the number of experiments you're always running in production. How do you enable that? This may be new to people who are new to AI because everything depends on data pipelines, well-connected, continuously moving data back and forth from your product to the evaluation, to all the offline analysis, to the systems that you need to retrain, monitor. Again, it's something everybody who worked in AI knows, but people, software engineers who are new may not appreciate it. That is where everybody gets stuck and struggle. Just want to call out that the more instrumentation you have on your agents and the interaction with the customers and the more evaluations you have, it will be easier to iterate. Trick question, what makes a great AI engineer? Have you been listening? There is no one answer, but that is this slide about.

Previously, we used to say, "It works in my machine". The equivalent I'm seeing is it works for my questions. It works for my data. Lots of experiments were launched. The customer asked a different question and then the whole thing fell apart. I also think that this is where we should partner closely with our product managers, help them, and ask them to step up to define clear acceptance criteria. Don't just write PRDs. Create the evaluation metric. Tell us what you really care about. Then give us the data to do evals on. That is how I think this space is going to evolve. That's where I've seen most success because the engineer may not be able to think about possible scenarios.

Like I said, the domain expert or the product manager proxy can bring in that coverage for what is not covered or thought about by the engineering team. We take evals very seriously, which is why I talked about all the platform capabilities we offer. We also provide training, tutorial, workshop, multiple repeat sessions, so the message sticks. We also escalate to the leadership to get help when we think people are not doing it enough. We get all that support. We also provide, like I said, starter code for various phases of the life cycle of the agent development. The right kind of eval that they must do. It's a journey, but we are helping. Hopefully, we will do more of these as more agents are produced. All of these, again, we provide the infrastructure support as well through continuous evaluation and monitoring pipelines.

Even if you do not do anything about agents today, the foundations won't change. The only thing I would ask is to keep investing in the fundamentals because that's what is going to help the agents of the future be successful. In any enterprise, if you look at the APIs, GraphQL, you will have a complex JSON, multiple level nesting. These were all built for humans to integrate by hand. In the current state, LLMs are not really good at processing this. We need to rethink about how we define our APIs so that they are tool ready for the agents to operate on, not for humans to integrate on.

Similarly for data, we have data lakehouses with a ton of data. How much metadata do we have about this data so that we can make sense of the data? The agents can make sense of the data and actually inject the right context into the agent so that it can operate autonomously. User experience, at Intuit I used to think we are a financial services company, all our data is numbers. What are we going to do with large language models? They are language models.

Then I realized, we ask people to fill a lot of forms, and all this form filling is to collect information. What if they can just talk? What if they can just upload? What if they can just take a screenshot? Multimodal native user experience is going to be extremely useful. All of this needs robust infrastructure. If you're not a social media company, maybe you only have very simple request-response format APIs, which all processes in 200 milliseconds maybe. LLMs changes all of that. Now you have these small models which provides reasonably fast responses. The reasoning models which may take any number of minutes, and there are in between. How does your infrastructure handle? What is your failed customer interaction in that new world? How do you define what an incident is? All of these need to be rethought in terms of infrastructure management. Also, if you have now bidirectional WebSockets for communication, because suddenly you want to enable voice everywhere. What does that look like? Like I said, if agents are going to be mainstream in the future, even if you do nothing to do with agents, these are things you can do to be prepared for that future.

Conclusion

Our journey has been a transformation. I've been at Intuit for 17 years. I've never seen anything like this in my first 15 years. I have led many transformations to AWS, public cloud, event-driven microservices, but nothing that came as fast and as intense like this. It's not just our transformation. It's because Intuit as a company decided to transform that we were able to pull this off. Close collaboration, partnership, providing both a unified platform as well as the processes around it. We build all the time what is not available off the shelf because we have to meet the regulatory standards. We also look outside and see if they meet the regulatory standards. It's always a constant evaluation of what meets our bar.

Identifying fixed, flexible, free, and have opinions, not just opinions, actual solutions that work for fixed. You can then provide options for flexible. This is something we had to do. We had to build in public internally. The sponsorship of leaders didn't mean we could go off for a quarter, work in our silo, then come back and present. No. Every week we present our designs, our decisions for the whole company to review. It's like exposing yourself. That had to be done because that's how fast things are moving. That's how we were able to take feedback quickly and adapt quickly. This is happening everywhere, transformation of software engineers to AI engineers. Not just the platform, we are also helping with all the training workshops that I talked about. Our hope is that through our developer experience, we really lower the barrier to entry for all our folks.

Questions and Answers

Participant 1: Can you comment on the percentage of truly agentic flows versus workflows? In the sense that the truly agentic flows would be more autonomous, like equivalent to Waymo's and self-driving cars.

Merrin Kurian: Yes, we are serving the spectrum from things are purely workflow with LLMs injected, to somewhat agentic. I don't want us to say we have true agent experiences with no human intervention. We are not there yet. We always have a human in the loop step today. Yes, the agents can take a lot of actions after a human approves, these are the next steps it can take. There's an intervention in between. Agent decides something, the human can always cancel. That's how it is today.

Participant 2: I had a question, not directly on this topic, but on the infrastructure that you need to support this flow. While going through this journey, did you encounter any impedance mismatch between how applications are traditionally used versus when you expose APIs to agents who are more chatty in nature and thus might require different levels of caching or SLOs from your underlying layers?

Merrin Kurian: Caching comes into picture when you have scale. A lot of them don't achieve it. We built semantic caching, thinking, this is something everybody needs. It took a long while for use cases to get there. There were teams who were coming at me asking, this is how we define failed customer interaction, how do you define it for LLM? I'm like, I have no idea. These vendors change things up all the time. Things which are slow today may get faster later. There is no point trying to define SLOs based on simply request-response latency, was one of the first things we learned. How do you incorporate into your existing set of monitoring observability systems? That was somewhat of a challenge. Again, it's a partnership, we had to work with those folks. Sometimes your dashboard is always red because you are taking longer than all the others.

Participant 2: I was more referring to say, when you don't have agents, your APIs get maybe 10 requests per second. Now with agents, you suddenly have to handle 100 requests per second.

Merrin Kurian: That's a different problem.

Participant 2: Yes, those things of scaling up your underlying infrastructure that existed before.

Merrin Kurian: It's not like these agents are suddenly scaling and going haywire without any planning. No, these are all planned rollouts. You achieve 5% rollout today. You know all the dependencies. You work with all those teams to scale up. That said, we had blind spots. The guardrails monitors, they depend on a lot of ML models. Those had to be scaled up, not just all the other things. Eventually, we worked our kinks through. It's not like all of a sudden, the agent is going out of hand. That's never the case. We deploy all our agents on our existing service runtime. All of those guardrails and protections apply.

Participant 3: You talked a bit at the beginning and at the end about AI governance. Maybe you could share two or three lessons learned about that and what you have defined and what worked and what did not work.

Merrin Kurian: Are you asking about AI governance?

Participant 3: Yes.

Merrin Kurian: What specifically are you looking for?

Participant 3: In general, what did you apply? There are some international standards out there which are not that distributed?

Merrin Kurian: Our teams are actively participating in NIST, for example. A lot of these standards bodies are where they are also driving requirements from. They also actively contribute. I don't know if you'll believe, we are actively collaborating with security all the time. Our security researchers are the ones who are actually building the guardrails. They're not saying, no, don't do this. They actually build the guardrails to protect the system as well. They all take great pride in being industry leaders. Our security researchers, our governance folks. It is a partnership. Sometimes they go way eager. Then we have to address the business need. It's a balance that we need to figure out.

See more presentations with transcripts

Recorded at:

May 19, 2026

Merrin Kurian

InfoQ Software Architects' Newsletter

Powering the Future: Building Your GenAI Infrastructure Stack

Summary

Bio

About the conference

Transcript

Agent Development: Pilot to Production

GenOS: Enabling Acceleration and Scale

Beyond Technology: People and Processes

How to Prepare for the Future of Agents

Conclusion

Questions and Answers

Related Sponsors

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ