InfoQ Homepage Presentations Beyond Prompting: Context Engineering and Memory Management for AI Systems at Scale

Beyond Prompting: Context Engineering and Memory Management for AI Systems at Scale

View Presentation

Speed:

52:35

Summary

Adi Polak discusses the architecture required to transition from stateless prompts to state-aware, context-rich AI agents. Drawing on 15 years in distributed systems, she shares how engineering leaders can leverage Apache Kafka and Flink for real-time stream processing, dynamic memory tiering, and tool orchestration via MCP to solve token limits, cost spikes, and latency bottlenecks.

Bio

Adi Polak is an experienced Software Engineer and people manager. She has worked with data and machine learning for operations and analytics for over a decade. As a manager, Adi builds high-performance teams focused on trust, excellence, and ownership.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Adi Polak: Today we're going to talk about beyond prompting. We're going to break some distributed systems down into parts and see how we can leverage that for memory management, for distributed AI systems that we're building. First of all, we want to start with something that is a little bit out of the ordinary. Has anyone here watched the movie, "Men in Black?" Do you remember the interview scene when Agent J, aka Will Smith, is interviewing for the job? Let's watch it. Zed: Edwards, what the hell happened? Edwards: Hesitated. Zed: May I ask why you felt little Tiffany deserved to die? Edwards: She was the only one that actually seemed dangerous at the time, sir. Zed: How'd you come to that conclusion? Edwards: First, I was going to pop this guy hanging from the street light.

Then I realized he's just working out. How would I feel if somebody come running in the gym, bust me in my ass while I'm on a treadmill? Then I saw this snarling beast guy and I noticed he had a tissue in his hand. I realized he's not snarling, he's sneezing. There is no real threat there. Then I saw little Tiffany. I'm thinking, 8-year-old white girl, middle of the ghetto, bunch of monsters, this time of night with quantum physics books. She about to start some shit, Zed. She's about 8 years old. Those books are way too advanced for her. If you ask me, I'd say she's up to something. To be honest, I'd appreciate it if you eased up off my back about it.

Adi Polak: That's Will Smith actually interviewing and succeeding in the interview based on context.

Because at the end of the day, what we have is like the best military minds. They're recruiting from everywhere from the best of the tops of the world. The tops of the world, what they did, they did pattern matching. They looked at all these aliens, or all these weird creatures, and decided, let's shoot them. Will took a different approach. He said, there's context to these things. Like, there's a napkin in his hand. The little girl holding a quantum physics books. It's a little bit off. In our world, we can think about the soldiers, the regular soldiers, it's like the LLM. They can hallucinate and give us a generic answer, because they're just predicting the next likely token based on a sort of level input that we give them, like a prompt.

Versus Will, which is like the agent with our context engine there, that actually understands the environment, looks at the image, process the metadata, and understands the intent overall. Quick debrief, we have the standard issue units, the one that follows the patterns, the LLM, the prompt world, and then on the other side, we have the spatial agents, the context aware system, like the top secret. A true effectiveness in the AI world, if we'll ground it back to what we do, takes more than just the compute. It takes more than just the power. We know today that we need that context, and more and more systems are starting to build those contexts in. What is the paradigm shift that we're seeing now in the industry? We're slowly in the industry migrating from us interacting with the model through prompts, to us providing lots of rich content through different systems and different tools that we want to give in order for the agent to make the right decision.

If you think about it, if you take a step back, you'll think like we're moving from a world of stateless application, the application after it dies it kind of forgets things, to a world where we have a state-aware, a memory of different levels that we need to engineer and think about how we're keeping those in place when we need to use them.

Background

Adi Polak: I'm Adi Polak. I wrote two technical books for O'Reilly on, "Scaling Machine Learning with Spark", and the second edition of, "High Performance Spark". I worked in distributed systems for more than 15 years, some of it in hands-on, some of that is leadership and management. I started my career in the AI winter, the previous one. Hopefully, no winter is coming soon, but the previous AI winter, when the things completely went down, and I moved to the big data field, analytics and data infrastructure. Two years ago, I decided to jump into the streaming world, because I believe streaming world is really helping us with the AI tooling from an infrastructure point of view. I work for a company named Confluent. There are some rumors we're to be acquired by IBM soon, but essentially by the original creators of Apache Kafka, which is one of the loved open sources in the world.

Prompt Engineering

Let's start from the beginning. Let's start with prompt engineering. In prompt engineering, when we're starting to prompt a system, usually we have four different categories that we need to think about. One is role assignment. For example, I'm telling someone that he needs to behave like a senior security agent and so on, and so I'm expecting that system to act in that way. Second thing is few-shot examples. Few-shot examples is when I'm giving the system a couple of examples and results, but we'll start with the patterns. See an alien, shoot the alien. Few-shot's an example. The system needs to complete the next patterns that it's going to see. That's few-shot examples. The third one is chain of thoughts. Chain of thoughts, essentially, it's already built in today to a lot of the chat experiences that we have, like Claude, or Gemini, or ChatGPT. Essentially, there's a system that thinks behind the scenes and double-check itself, build a whole mechanism around planning and reasoning and executing and so on, and build some chain of thoughts and start to execute on them one by one before it actually gives us an answer.

If you interact with these systems today, you already know it wasn't what it was two or three years ago, where you give it a prompt, you get back a response. Actually, behind the scenes, it takes multiple steps and it has multiple layers of microservices. The last one is constraints settings. In the real world, we want to make sure that we are establishing the rule of engagements for these different tools. What data am I allowed to access? What can I say exactly? Is there some keyword that I'm not allowed to use? Is there an allow list, and so on. Constraints settings as prompt, when we give it to the model, give us the structure that we need and rough guardrails. If you work in the cybersecurity world, you know how easy it is to bypass that. The reason for that is because model has limit context. They cannot carry everything. Once we pass that, we're losing those constraints settings.

Again, we want to use it always as part of the prompt engineering work that we do, but we do need to be aware that there are limitations to these approaches. Essentially, at a high level, you can think about prompt engineering as better questions with more information that we need to give the model.

Context Engineering

What is context engineering? Context engineering actually holds within itself prompt engineering. Context engineering is an umbrella of everything that we're giving these models, everything that we are injecting, and we're deciding how to construct their tokens in order to get the best performance out of that model. We have memory management. Memory management has two layers. One is short memory, like the things that I need to summarize right now, very quick log conversation. We don't want the agent to forget about those short memories. The second one is long memory. For example, vector databases. We all know RAG and solutions like that. Vector databases that we're writing to. We're not pulling the information, but we're writing them to, so we consider the long memory is going to stay there forever until we decide to do something about that. Then we have state management. State management helps us decide what are the steps that we are in right now.

Today we know that our applications, our agents are multi-steps. It's not just one, constructing the context, sending to the LLM and done. We actually have a whole feedback loop with a judge or other mechanism for learning where we collect data back and we're deciding if the answer is good or not. We have that stateful application, and we need to continuously manage that state. In a distributed system world, it's hard to manage state. State can explode, and we're going to talk about it later on, how we can build those to make it more sustainable. RAG is also an important part. RAG is for retrieving the data. It's a retrieval mechanism. The data that we need in order to provide more context about our environment. You can imagine if I'm building an agent that helps me book travels at work, I need that agent to be able to pull in the latest policy of travel, and so it can help me book that travel that is actually in policy, because otherwise my manager is not going to love it.

It's definitely an important mechanism that we need to have there. Lastly is the tool access and execution. If you heard about MCP, MCP is one of the tools we have there. There's a lot of other tools that we need to be aware of, but that's part of every agentic system that we have, is the ability to access tools outside of the tools that I currently have. Perhaps I need to go to some SQL database and pull some more information. Perhaps I need to construct a SQL query that I need to build. There's a lot of things that we need to do, and then we can also execute on that.

Application Point of View

Where does it bring us when we are developing these applications? Essentially, we're moving from a world that we considered as stateless. If you think about pure prompt engineering, how it was just a while ago, it's us talking to the machine. Once we close the machine, the machine forgets. This is a pure stateless application in our world. There's no persistent data. Very simple architecture. Very easy to scale. If you're looking to scale an application endlessly, if you can land on a stateless application, the sky is really the limit. You're all about constraint of resources, you don't need to coordinate anything related to data. It makes it very easy to scale. The reality of things is that we are moving from these stateless applications to a stateless world that we need the context for the agentic loop that happens for us behind the scenes.

Essentially, we have an agent that gets a prompt. As we know, it's going to be sending it to an LLM. It gives us some response. It does some reasoning around that. It takes some action. Then it stores it back into the memory, giving what came out from the action. Throughout this action, we also get a grade, we also get thumbs up, thumbs down, or some score around how well that actual agent did, that later on is going to be fed back into the agents in order to improve its activities. There's also a culprit in that loop, because we can accumulate forever data. When we want to accumulate forever data, that means that every time I'm running that LLM, I need to take all this context, reconstruct it, and send it to the LLM, which can be endless. That can also be confusing, and that can also be expensive, and sometimes not feasible, because we have a limited amount of token that we can send to an LLM.

This is an important part when we're thinking about context. At the end of the day, we are constructing all of those good things, and we need to pass it to the LLMs that has the limitations on tokens. How we're constructing them, the context that we're bringing each time is going to be critical decision-making that we'll have to do as well. How does it work? We have the very high level. We have the right context, like the long memories that we have in there. We have the scratchpad, things that we need right now for a specific session, and we can categorize them. We have the state within specific sessions with the user. From that state, we can decide what needs to be saved for long-term and what's not.

For example, if I'm building an agent to help improve software, for example, in my organization, there are some things related to long-term memories, like which programming language we're using, where is our testing system, what is our testing philosophy, and so on, that needs to be saved within the long-term memories. I'm not necessarily expecting the agent to know it right away, but as long as I'm working with it more and more, and my engineering team is working with it more and more, this data is going to be saved. I'm expecting the model to know that this is important long-term data that is going to be saved for context. Later on in the retrieving phase, we started looking into what we actually need to pull. Do we need to pull from long memory when we just started a conversation? Do we need to pull from a scratchpad, and so on.

Another layer is compressing. How do we compress this specific information in order to have a fast retrieval and also have less tokens for that context? We have to run embedding in order to send anything to a model and save things in a RAG database. We all know that by now. Also, there are things, we can take a long text and actually shorten it, so we're going to look at that as well. Of course, we can completely decide to remove things that are relevant. Lately, we can isolate the context based on specific customers, specific users, specific groups, specific policy, and so on. Truly, if you think about it from a distributed system world, it's kind of like the multi-tenant when we need to serve multiple customers on the same cluster.

Key Challenges and Solutions

Why don't we just use Gemini, Claude, give it all the tokens, and dump everything in? We talked about it for a little bit, but let's look at the statistics. This is according to 2024, and our system always get upgraded, there are new models. Some of the models here became sunset, for example, Gemini 1.5. Essentially, it tells the same story. We have a finite amount of tokens, and we cannot exceed those. Where are the challenges when we're building a system? First of all, when we're interacting with the system, one of the challenges is latency and cost spikes. We build a system, everything starts to work, but then suddenly there are more people that want to use it, we need to address the issue of cost. We need to address the issue of latency. Latency matters when we're interacting with this system. Another problem is the lost in the middle problem. We have too much information. The last one is context collision and hallucination. Essentially, the model tells me, there's too many things here, I don't know what matters. It's like a human problem that we face as well.

Let's break it down to key challenges and solutions. First, state management. We know 100% we need to do it, we need to be explicit about that, and we need to be able to process it and having the process flow. Again, state management is the world where we know inside the session, we know exactly the state that we are at. Second thing, the short-term context. Short cliff, lost in the middle, all these problems, especially when we need a high-latency environment or we need to have a low cost, for example, means we need to compress and build hierarchies for these summarizations, even within our short-term context. Lastly, there's the long-term memory, which we can solve with a hybrid retrieval. For example, maybe we need semantics, maybe we need other solutions, but essentially, we can solve it with this hybrid retrieval that helps us make better decisions for the model. Retrieval is a big challenge, but also there's great solutions already in place. First one is hybrid search. Hybrid search, essentially, I'm asking the agent, think what is going to be the best way for you to find and retrieve that information.

Maybe you want to bring it from one system, from the other system, and so on. Inside the world of RAG, when we are building this vector database, there's a matter of weights when we're retrieving information. There's an algorithm that we can decide and change and re-rank how these answers are retrieved back from the model. We're actually taking another step back into fine-tuning and optimizing this world. Summarizing text, we talked about it. Contextual chunk retrieval, let me get just one chunk, create a whole feedback loop with a judge, tell me if this chunk actually was useful. If so, let's bring another chunk and so on until we realize that enough is enough, we reached the peak of information and insight that we can get out of that data. Prompt refinement can help me a little bit if using it correctly. For example, what was the original prompt? How do we want to change it? Do we cut it? Because we also need to take the prompt and construct it as part of the tokens. Domain-specific tuning. It's a very simple string manipulation. I can replace specific keywords or add specific keywords. It's going to help me improve my retrieval, and other solutions that can come to mind.

Let's bring it to our list of challenges and solutions in that world. Essentially what we have, we have the retrieval phase that we want to solve, so we need a hybrid retrieval. We need some query expansion and we need some scalable algorithm that can help us retrieve that information. Later on, we want to be able to augment it back into the model, so we need this hierarchy of summaries that we need in there. We need some ranking and filtering for information overload. We don't want to capture all of the data all the time, so we need to be smart about it. We need to have some dynamic context compression that helps us stay within limitations of tokens. There's another area where context can impact us, and it's not only context, it's the context in how the model was built.

When researchers train LLM models, at the end of the day, it's up to date to a specific time, and all the new information that it gets, it's basically adding those data back in. That could be that there's bias amplification, depends on what the model was trained on. Of course, hallucinations on data that is not relevant anymore. Overall, we have the systematic challenges that we need to solve, make sure our data is up to date, regular updates. Latency bottlenecks, so low latency algorithms and caching, how do we go about that? Lack of transparency of our model, how do our models actually operate? When I'm getting back an answer, it sounds like something I can trust. Can I actually trust it? I need to be an expert in a specific domain in order to trust what I'm getting back. How do I make sure I'm building that judge system that it's, again, another LLM or an algorithm that they have in there that needs to tell me if that was correct or not?

The Architecture

We talked a lot about challenges, let's talk about our architecture. The architecture that we build at Confluent specifically is around data streaming because this is our expertise. As I said before, we are the main contributors or the original contributors to Apache Kafka, and we also contribute to Apache Flink. Both of them are solutions for data streaming. Apache Kafka specifically is like a pub/sub mechanism for event-driven applications. Apache Flink is for real-time processing. There's also batch processing, but this is less used. What it's mostly known for is for real-time processing with very low latency capabilities. We took Flink and we turned Flink, we built like the developer API for a Flink streaming agent. This is an open-source library that you can take a look and use into how we build it. The reason that we use Flink specifically is because it has the milliseconds latency, so scale streams in real-time. It gives us the capability of managing state. We'll dive into what this managing state means in Flink.

It has exactly once consistency. In the event processing world, exactly once means that I'm processing at only one time, essentially. There are three levels that I can choose, exactly once, at most once, or at least once, so I can control and I can make sure that no messages are being lost in my distributed system. In distributed systems, although we can be the smartest engineers in the planet, things always happen, and so being able to have that mechanism of exactly once is really critical to when we're building these agentic systems as well. We don't want to lose messages, we don't want to lose context, and so on. We can leverage well-known AI agents on top of those Flink APIs, so that gives us another capability that is easy. Underneath the hood, there are different tools for observability, and I'll show you how these connects when we are bringing in Kafka, for example. If I'll take it from the top to the bottom, we have the Flink streaming agents. We have an optimization engine.

Flink has an optimization engine for all the transformation that we're doing there. Also, access to storage. Then we can break it down into an orchestration, because at the end of the day, it orchestrates a distributed system, and then makes some transformation on top of that data, so actually these transformations are going to become like an API call that we want to use. It has built-in memory through checkpoints and other memory state management that it has. Flink enables us to build a completely stateful application. There's also tool calling that we're bringing. When I'm pairing it together with Kafka, essentially what I have is I have all my events coming in in real time, and then I have my Flink taking it from Kafka and start to process that data in real time as well. Kafka also has something called Kafka Connect. Kafka Connect enables me to connect to different systems in the world.

You can think about BigQuery, Salesforce, SAP, whatever systems that you're working with, and bring that data in in real time through Kafka Connect, so that data can help me enrich the context for my agents in real time and becomes part of my larger architecture. Kafka also has an aspect of memory. Essentially, the original way of running Kafka is when you have clients and you have the brokers. Kafka has something called topics, and the data of the topics itself is being saved on the Kafka broker and partition, and of course in a smart way. It's being saved on the Kafka broker, and then that broker actually can become another layer of memory that we can use. This is what we did. We took all these two great distributed systems, broke it down into pieces, and reconstructed them to what we need in order to build those agentic AI orchestrations. How does one translate to the other? How do these two systems that we know from real-time streaming translate to what we have today?

I need to take a step back and talk about Kafka and talk about a little bit of what we did to Kafka as a company. We have millions of topics. We have the largest Kafka cluster in the world that we're managing on behalf of our customers. In order to be able to manage it with multi-tenancy and security and everything related to that, we built an engine called Kora. Kora is a cell architecture-based engine. What is cell architecture-based? Cell architecture is how AWS builds S3, for example. It's how the largest vendors in the world that provide cloud solutions are able to scale. Essentially what that means, I'm breaking my architecture into a small cell that is self-served for multi-tenants, for multiple customers. If I'm looking at my customers, at the top, I have a layer of networking that helps me understand where the customer requests should go.

This is something that we're very proud of, our layer of networking, what we call as part of the control plane, is probably state-of-the-art. We also won a prize for that for VLDB. There's a great article about that. I'll give you the link. Essentially throughout the latest outage on AWS, we were still up, everyone was still up, our customers were very happy, and that's because of this architecture, that specific control plane. We have the customers. We have the network. Underneath that, we have compute. We took Kafka, we took the brokers, so it essentially has compute and storage on one machine, and we split it. We have storage where storage is, like an object storage. Then we have compute where compute is. We also know that when we're getting VMs from some of the cloud providers, we're also getting storage back. We want to make sure to use that. Some of them have SSD, for example.

If you think about it for a second, this SSD actually enables us to give another layer of storage, another layer of state management that we need in that. This is another Spark mechanism that not only helps us deliver in a very low latency, it's like one digit millisecond latency for retrieving information from the topic, but also, in the world of AI systems and AI agents, actually becomes a very strong mechanism for us for retrieving information and building that context engine the way we want to build it. As part of that, there's always observability metrics and other things that we need, governance, access control, security, processing, Kafka Connect in order to bring our data in. This is at a high level. If I'll zoom in, only on one cell of Kora, what we'll have in there, you will see, there's a Kafka cluster, you'll see the small Kafka logo within. You'll see object storage. You'll see SSD volume.

You see a controller that communicates with the rest of the system to answer the health check calls. We have different, of course, availability zones for each one of them. Behind the scenes, what we have, we collect a lot of statistics on usage of different availability zones and different cells in order to understand if we need to scale up or down. Before even entering that phase of a customer needs more topics or needs more access, we already know that we need to scale up or scale down based on the statistics that we're collecting. Internally, we call them agents, because these agents are collecting data of usage, of metrics, and they're being adjusted all the time. If you think about an actual AI agentic system for a second, we built on the side like a mechanism, a feedback loop of, maybe we should take a look and update those statistics here because these now have changed.

We noticed some new trends, so we are slowly incorporating it, gradually, doing it very smartly and very responsibly, but we're incorporating slowly more AI capabilities into it, more of the recommendation areas until we feel secure and safe. Inside those, there's a lot of statistics that we're collecting in order to make decisions based on those and how to scale that system. We talked about Kafka, and we said Kafka, essentially, is compute and storage in one machine, but on Kora, we broke it down into memory that is separate or memory that is on object storage. Our object storage can also be, later on, shifted into a specific format, like Iceberg or Delta. This is through a different mechanism. Those are important when we need to retrieve information that is historical information for the agents. Agents, sometimes, will need historical information about the world, and so this information, we also need to be able to retrieve that when we need it.

If you look at the top, there is the SSD, so SSD enables fast caching, fast serving for that cache. Whatever I'm saving in there, I can retrieve it super-fast. That's one of the key parts that we need to make sure we have when we're building those. Let's talk about the tiered storage and try to map things a little bit. If I have my compute, my compute can have an SSD or other machine disk, depending on the VMs that I have in there. Some VMs come out of the box with that. For example, in Azure, Azure VMs come with a machine disk for you. You cannot just take compute, so you better use it. It's great for margins, if you're tracking your margins, for example. This can help us in the AI world for short-term. Why short-term? It's fast. It's there. I can start using it right away.

If my AI agent is having a longer session or needs to summarize things, I can automatically save it into a Kafka topic. In my Kafka topic, I configure that specific Kafka topic to make sure this is highly available data, which means this is going to be saved on an SSD or on a machine disk, so I can benefit from the millisecond retrieval that I know this system gives me. This is the short-term. Longer-term is when I'm taking the data and I can configure that specific topic to take that data or above a specific threshold and send it to an object storage, S3, Azure Blob, GCP, and so on. This is where the long-term get into place. It's what I need to remember about the overall session or the overall user. Maybe I have some policy, maybe I have some more information that I need to save there, and so I can configure this Kafka topic to say this is my long-term memory management for my specific Kafka topics, and so I'm taking it and saving it over there.

It can also be through Kafka Connect. Like I said before, taking the data, dumping it into a VectorDB or anything else, and then retrieving it when I need to retrieve it. Let's put it all together. Our streaming agent system that we built, essentially, we took the distributed system and we reconstructed it to what we need it to be. We didn't change a lot of things, we just added a couple of more layers in order to serve more capabilities on top of the things we have, and of course optimizations for some specific areas. Let's talk about state management. This is an area that we didn't touch a lot of. Flink can have a stateful computation, which means it's saved within a context window. It has its own window, not related to context engineering, but it has its own window that we can save that specific state.

That state can either go into a RocksDB, which we know how to configure to be super-fast, or it can go to other checkpointing mechanism. Checkpointing mechanism has different layers of storage there as well, and we can decide where we want it to go. Do we want it to go to opposite storage or do we want it to go somewhere else? That gives us our state management reconstructed specifically through Flink. Memory tiers, we talked about it a lot. Let's jump into execution and orchestration. Underneath the hood, every distributed system has orchestration. No matter if you're using Flink, you're using Spark, using other distributed systems, at the end of the day, there is an orchestrator that helps you move the task from one to the other, and becomes the brain of how actions are going to happen in your cluster. Flink, because it has this orchestration, enables us to connect to different tools and execute based on different tools.

Flink has transformation. Transformation can be a user-defined function. Throughout a user-defined function, we actually can connect to a model and run different actions on top of these models: send request, get response, and continue to process the data as we go. Essentially, when we have those tools in access, we can actually call LLMs, because my LLM has a REST API. I can access it and start running queries or start questioning against it based on the data, based on the prompt that I got from the user after I enriched it with some context. It helps me manage that inference request to the large language model, all in one engine. It orchestrated. Because it's orchestrated, because it's orchestrated with Kafka as well, we're benefiting from the infinite log that Kafka is. Kafka is an infinite immutable log that we can always save data. That means for every interaction, if I'm moving my events and my context management through Kafka and through Flink, I can later on replay some of these messages if I wish to.

I can have observability on top of that, out of the box, because I can take Kafka and connect it to OpenSearch, or Elasticsearch, or Datadog, or your favorite observability. The fact that you're already saving all those and there's already really good integration with observability tools, really simplify that path to observability and also the path to governance and other things that we usually don't like to think about, but we need to think about when we're building those systems.

Apache Flink Agents Documentations

Let's take a look for a second on Flink Agents Documentations. Like I mentioned, Flink agent part, the API part is part of an open source that we're contributing to. If it's something that you want to play with, you want to try out, you're most welcome. Specifically, let's look at the ReAct agent. ReAct agent is an agent that's supposed to think, take actions, observe, and repeat. You can see at the top there on the left side, we have to define, we're going to use an agent to start with. We're creating the descriptor. We're going to give it whatever model that we want to use. Maybe we want to use ollama for setups, the connection, the models, and the tools that it's specifically going to use. Then we want to give it a prompt because we always need to have that prompt. We also want to give it an output schema because everything that's being sent through Flink has a schema, everything that is retrieved from Flink has a schema, and it's ok for my schema to have plain text in it, but I do need to make sure I am giving it a schema in order to start executing with that.

Later on, when I have this model, I have this specific prompt from the model, I can start using it and integrate that with the rest of what I have in Flink. If I have my product review stream of data that continuously comes in that I'm monitoring and looking at, I can start running some Lambda applications, product review models to validate that model that I'm getting in. At the end of the day, what I have is a simple SQL query or a simple Python or Java code that I can also use that gives me a response. Where underneath the hood, I have my distributed systems that manages all the states and everything that I need behind the scenes for me. For example, at the bottom, we ended up printing it. We probably won't want to print it. Probably, at the end of the result, we want to inject it back into another dashboard or application or monitoring capabilities in order to understand if that specific output is what we're looking for. This is how we're putting it together. If I'm like a product engineer, what I'm doing is I'm building the logic. This is how it will look like for me if this is what I've been doing.

We can also connect it to an MCP, for example. A lot of the work that we do today with tooling is actually through MCP interaction. With MCP, we can create a connection to an MCP server. Once we create the connection through SQL, we can embed it into our SQL query. You can see on the right side, we're creating some connection to DeepWiki. We're giving it some API keys in order to actually execute on that. Behind the scenes, we have a secret store to manage those API keys. Then after we created that MCP, we can create the model that we want to work with, for example, OpenAI in that case. Then we can start creating the table and invoking calling to these models through MCP in my SELECT statement. You can see in my SELECT statement on the bottom, you'll see AI_TOOL_INVOKE. This is a secret keyword or a safe keyword that we're using in order to invoke the specific model that are the tools that we created to start running those and sending those to the MCP server that is connected.

This is how we're connecting the whole thing. We have the MCP, like the claims_mcp_server. We have the model, the tool invoker. Then we have, at the end of the day, the query itself. This is what we're experiencing when we want to build the logic.

Use-Case: Real-Time Trading Volume Anomaly Detection with Kafka and Flink

Let's talk about a real use case, someone that I work with very closely. This is from an E*TRADE world. They're building one of the largest E*TRADE platforms on the planet. One of the things they really care about is anomaly detection around volumes. We want to know when you're trading some specific either stock or other assets or multi-assets, so it doesn't have to be stocks. If you're trading stocks or other assets, you want to know the volume. If there's anomaly specific to that volume, it could be that people are more interested or less interested in that specific asset in the moment. What do we have? We need to pull information from market data. We have the trading engine. We're bringing that information in and streaming into my Kafka topic, like a raw trading volume. Later on, we have the Apache Flink cluster that wants to process that data and actually start making decisions and add agents to run these anomalies and give us the feedback loop.

My Flink cluster is going to consume from my Kafka topic, and it's going to start ordering things, KeyBy, tumbling windows, aggregation, and lastly, anomaly detections that is a built-in functionality that we have in Flink. Anomaly detections, there is a notion of threshold. There's a notion of configuring how the anomaly detection is going to run. We have to be able to give it previous data. We have to be able to give it some configurations in order for it to act well. We cannot just use any model and hope for the best. We actually need to understand how these things work. This is where my agent comes in, because in my agent here, I can actually take that data and start playing with the thresholds and start being smart about thresholds, given the agent domain expertise. If I have, for example, a small language model that has been trained specifically for trading, and it noticed that there's some variability in the data of the trading volume, it knows how to help me adjust, and also, it knows how the anomaly detection algorithm works in the world.

It can give me a recommendation on how to adjust this threshold in my anomaly detection mechanism in order to get better results and make sure we're not losing the market. In the beginning, we start with a suggestion. Let's make it a suggestion, create an alert. My agent is now creating an alert about this system that tells me, you need to adjust the threshold or look into the data, and slowly incorporating through an A/B mechanism, more actual action-making and decision-making for that specific agent. This is how we constructed a real system, and this is how, under the hood, we have Kafka and Flink supporting us through that.

Recap

I hope now I convinced you that the future of agents is stateful, and we need to be able to solve hard problems, even harder than the ones we had today. Stream processing, we always need that. I always say we don't want to cross the road based on yesterday's snapshot of traffic, and so I need to know right now what's happening with traffic, so stream processing in real time, bringing the data in real time can really help me manage that state. The context pipeline or the context engine is the new model architecture that we need to think about out loud. Lastly, I want to leave you with that. There's a lot of things that can get very complicated in the world of building those agentic systems. If you already have a system in-house, think about how you can take it and reconstruct it for what you need, and just do the simple thing that works. Everyone says that, focus on the value, and keep it simple.

Questions and Answers

Participant 1: For the Flink AI tool call, is it only available on Confluent Cloud, or did you make it available for the open-source Flink so anybody can potentially use it?

Adi Polak: Yes. Some of that is available on open-source. You can take it and use it. You will need to manage Flink under the hood, so something to take in mind, how do you connect it? There are some tools that are still not part of the open-source, but working on it. Just because the complexity of putting it out there without the management part, we still haven't figured it out, so we still need to understand how to do that.

Participant 1: That specific example of the AI tool call that you had, I don't remember where.

Adi Polak: The MCP one?

Participant 1: Yes, this one.

Adi Polak: Both of them have AI. MCP specifically, it's still not in the open, but we're working on that.

Participant 1: The tool, you're working on at least in open-source?

Adi Polak: Hopefully. Once we figure out how to divide the systems.

Participant 2: A few slides ago, in the one about the Flink and Kafka architecture, there's a tidbit about how to use Kafka topics to manage the memory. Or maybe I misunderstood that. There's a bit when you talk about the memory blocker, you're saying using Kafka topic to handle the memory portion. I was wondering if you can clarify that a little bit on what that means.

Adi Polak: Kafka, the base core, if you look at the open source, we have the clients, and then we have the brokers. The brokers are both compute and storage. When you deploy it as-is, when you don't change the architecture, you have compute and storage there. You can decide that the storage, for example, is going to be a machine, you buy a machine from Azure, from AWS, and it's going to be a machine with SSD. Now, all the Kafka topics storage is saved on SSD, and you earn that fast caching, so you can bring information very fast from Kafka brokers. Because you have that mechanism, you can say, ok, as people that build a system, everything related to this fast memory that I now need is going to be under a specific Kafka topic that I know every time I need to scale this Kafka topic, it's only SSD-based machines. Now you have this fast management, short-term memory management, usually, to retrieve faster data.

Participant 2: It's using Kafka topic as a way of caching, which then by way of using as a way to store and categorize short-term memories.

Adi Polak: Exactly.

Participant 2: How long do those memories typically last? What's the typical life cycle, then, of the short-term memory?

Adi Polak: It really depends on the application that you're building. Usually, short-term would be for the session. You can define what a session means. For example, for the anomaly detection that I gave, as an example here in the end, a session closed once the threshold optimizer agent either makes a decision, or fires an alert, or writes to log that nothing happened. That's a full session.

Participant 3: I had a question about this example. In your real-time trading volume anomaly detection example, when you have this thread optimization agent with AI, ML, since you mentioned you customized this model. Did you have to play with the context length of that model itself, or you externalized it with all this SSD and all that so that you could still use the existing context length of that model?

Adi Polak: The context length, we have two limitations. One is the number of tokens I can send the model, and the second thing is how fast this model can execute on the limit of tokens we're sending in. It's like, you have to build a metric. When you're using this platform, you have to understand what is the accepted latency that you're getting from your LLM provider giving those capabilities. Here, specifically, we're not providing LLM inference. We're using existing LLM inference like Claude, Gemini, and so on. It really depends on that. That's a little bit of mathematics and calculations. You can also offload it to a potential agent that is good in mathematics to give recommendations for that. You have to create a chart, look at it, and make conscious decisions about what makes sense, and perhaps even adjust it. Some of our use cases is you have like dynamic adjustment based on the importance of the context that we just retrieved. We have the layers of data that comes in. Most important, must have that, so you need to figure out a way to keep that context versus nice to have context that is not always necessary.

Participant 4: I have a quick question on the streaming agent system that you guys used for Flink. What's the major difference between other agents versus the streaming agents?

Adi Polak: We already have a streaming platform. This is what we build. We've been doing it for 11 years now. We took our streaming platform that we have, and we reimagined what would that mean now for a user that wants to build an AI system knowing that agents need access to real-time data and access to real-time compute. Because historical data is good to some extent, and we always want to leverage that historical data, but we want to pair it with real-time. This is how we took that, and we built those APIs on top of what we have in order to expose a streaming agent system. The reason that I'm sharing it is because many times you already have systems in-house that we can leverage. Maybe we need to add a vector database and things that are more relevant to the AI world that we have today, like MCP and so on, and maybe specific algorithms to summarize context and compress context. From an infrastructure point of view, many times, if we need a distributed system, we already have those in place. This is just a mapping, essentially.

See more presentations with transcripts

Recorded at:

Jun 10, 2026

Adi Polak

InfoQ Software Architects' Newsletter