InfoQ Homepage Presentations The AI Gateway: Scaling Centralized Inference Across Decentralized Teams

The AI Gateway: Scaling Centralized Inference Across Decentralized Teams

View Presentation

Speed:

Download

46:51

Summary

Meryem Arik discusses why modern engineering teams face "inference chaos" and how AI model gateways provide a critical control layer. She explains the balance between empowering decentralized teams to choose the best models and maintaining centralized oversight for security, RBAC, and cost control. Explore open-source solutions like LiteLLM and Doubleword to streamline your AI infra.

Bio

Meryem Arik is the Co-founder and CEO of Doubleword (previously TitanML). She frequently speaks at leading conferences, including TEDx and QCon, sharing insights on inference technology and enterprise AI. Meryem has been recognized as a Forbes 30 Under 30 honoree for her contributions to the AI field.

About the conference

QCon AI is a practitioner-led event focused entirely on the engineering discipline required to scale these workloads safely. It provides direct access to the architectural playbooks and failure metrics that peer organizations use in production.

Transcript

Meryem Arik: I'm Meryem. I'm the CEO of Doubleword. I used to be a physicist at Oxford. I'm a big rugby fan.

I'm going to talk about AI model gateways, which is actually a pretty dry subject. It's not really what my company does. You might ask, why am I interested in this as a topic? We started Doubleword about four years ago, always focused on the problem of inference. The inference process being the process of actually running the models. What we kept seeing with our clients is we were providing inference and inference services to them, but we typically weren't the only inference provider they were using. They were probably using OpenAI and maybe Mistral and maybe some self-hosted fine-tuned models that they built themselves. We saw them getting into a bit of a chaos situation with all of these different providers. We ended up having to try and fix that for them.

AI model gateways are a really easy way to bring order to a chaotic environment where you have a lot of different model providers. Because we kept seeing this problem over and over again, we actually built an open-source AI model gateway. We have experience building these things from the ground up. We don't sell it. It's not a commercial project of ours, but we think that AI model gateways are very important. We think everyone should be using them.

Do you currently have an AI model gateway in your organization? Something like LiteLLM, OpenRouter. I'm going to try and convince you that every single person should have an AI model gateway, even if you're deploying it quite small scale. That is my goal of this session.

Outline

This talk is actually inspired by this blog that our CTO wrote about control layers and model gateways and why they're valuable. It's fun bedtime reading and it's not super long. What we're going to cover, three things. What are the inference demands that your teams will have? When I say decentralized teams, I mean the teams actually building your use cases. Why does it make sense to have your inference centralized in some kind of way with a central inference platform? How AI model gateways can be used to solve the tension between decentralized solutions and centralized infra. Then hopefully by the end of it, you'll be like, we should all put AI model gateways in our businesses.

Inference Demands

Inference demands. Every single application will have a different requirement for inference. There's no one model that rules them all. This was a Nano Banana, and I think it's pretty good. You, for example, sometimes need different dogs or different models for different use cases. To take a very English hunting analogy. If you're going hunting, which is a very posh thing to do, you might have a pointer, which is the thing on the left. They essentially point and are used to locate game and used to locate what you're shooting at. You will also need your spaniels, which are used to flush out game. They run into the forest and they make all the pigeons fly up that you can then go shoot. Then you want your retrievers at the end that will go and get the pigeon and bring it back to you.

If you actually want to execute a successful hunt or a successful use case, you need a lot of different models working together seamlessly. There is essentially three dimensions in which you need to consider to pick the right model for the job. You have the application quality. How good is this model actually solving the problem that I have? Does it get all the answers right or does it get them all wrong? A good model for the job will get them right. I have a lot of non-performance reasons. For a lot of companies that we work with, they will have things like everything needs to be in AWS because that's who we have credits from, or everything needs to be in the EU or in Mexico, for example, because the data needs to be resident there.

Then that will narrow down the models you can use. Then you also have inference performance tradeoff considerations as well. I might have a model that is exceptional at getting the quality that I need, but it costs me 25 bucks to run every single time and so I'm not going to use it. I'll talk a little bit more about those aspects individually. Something that your use case teams need to be able to figure out is how good the model actually is at performing the task. There are dimensions like modality. Is this an embedding task, an image task, a voice task? Each one of those is different models that you care about. What is the performance or level of intelligence, which you can figure out using various benchmarks and Elo scores? How well does it perform in my specific domain? We frequently see this with health care clients where they want models that understand their particular domains. Maybe I actually want to use models that are very good at my specific task. For example, labeling cancer screens is a very specific task.

If I'm empowering my use case teams, I want to give them access to all of the models that they need to be able to build a good application and pick the right model that performs best. The non-performance reasons. These are often the most boring reasons, but are very important. Pre-commits with certain vendors. For example, if you're an AWS shop, you probably are going to be using OpenAI because everything you want needs to be within AWS. Data residency requirements, so in a certain region for GDPR or maybe with a certain compute provider. We need to consider what are the models that will satisfy these non-performance reasons, and we need to give our teams access to those.

The final thing that your use case teams will need to consider when they're picking the right model for the job is the inference performance. Is the cost acceptable for the use case? For example, let's say I have one use case, which is actually an online girlfriend chatbot use case. I'm probably charging my end subscribers like $5 a month to use this service, and so I need my inference to be really cheap. Let's say actually my use case is about cancer screening. I'm charging my end user a bunch of money to get access to that, and so I'm happy to have a higher cost. That cost needs to be acceptable for the use case. Latency is a really important one. Do I need very low latency, in which case I might use something like Cerebras? Or can I also do my inference async? What's the throughput I need? What are rate limits that I need?

Different inference providers will have different tradeoffs in this space. We are really interested in high-volume workloads. People like Cerebras and Grok are very good at low-latency workloads. As a central inference team, you need to provide your decentralized teams with access to the right tools. I can give you a couple different examples. I think this is a place where Nano Banana really shined. Different use cases are going to pick different models. I can walk you through a couple of use cases and what the tradeoffs might be. I can start with a coding assistant. Hands up if you use some kind of coding assistant. Claude Code? For this kind of use case, because you're actually interacting with it constantly, you need it to be super low latency. I don't want to start writing the line of code, and then it takes five minutes for it to autocomplete the rest of the phrase. I need it to be super low latency, and I need it to be really high quality.

This is an example where anything other than completely accurate is useless to me. My inference setup, the tools that I need to provide my team might be called Opus, for example, which is a great model for coding. I need super low latency real-time inference. I have another use case of data labeling. Let's say my team is building data labeling as part of their pipeline. Maybe we are taking a bunch of images, and I need to label when a certain thing comes up. Because I'm doing this at high scale, likely, I need a very low inference cost, because the value to me isn't that high. I maybe want it to be domain-specific, so I might use a very low-cost provider of a fine-tuned model.

Then the guardrails, which exist in almost every single application. A lot of people route their requests through guardrails before they send it off to the models. I need this to be super low latency, because this guardrail application is hitting every single request that I send, and I don't want it to be a big drain. I maybe also care that this is resident in a location that I care about, because I don't want to accidentally send PII somewhere I shouldn't. The takeaway that I want to portray here is that use case teams really need the freedom to pick the right tools for their use case. A given use case might need multiple different tools and multiple different providers.

Centralized Inference Capabilities

I'm now going to try and make the argument that the inference tools, so we've discussed that the use cases should be decentralized and teams need to be empowered, but the inference tools should be centralized. If I take all of this to its natural conclusion, I'm letting every single team use exactly the tools that they care about for the use case they care about. What I'll end up with is some horrible spaghetti where every single team is going to be calling to a bunch of different models, and that I have API keys running around everyone. It's just a big nightmare. This was actually fine about 18 months ago, maybe 12 months ago, where I had one team experimenting and everyone was using OpenAI, because that was the best model at the time. That's just not true now. Now I might have 12 use case teams. I might have eight models that I'm using. Maybe I have OpenAI and Fireworks and AWS. Maybe I have 37 different API keys because we keep generating new ones for reasons that we don't understand. A bunch of products nobody approved.

An intern that accidentally spent $4,000 on a weekend. I've actually seen this in real life, and it was much more than $4,000. This is the problem that you have when you have decentralized teams who are just using the stuff that they want to, and it just becomes complete chaos. We need to be able to empower the teams, but do it in a way that we are still controlling everything that's going on. There's a lot of reasons why we should be doing this. I'm going to walk you through a couple. Here are the reasons you should be centralizing inference if you're self-hosting your language models. Can you put your hands up if you're self-hosting at least some of your models? You're deploying models on VMs or maybe in data centers that you own. The reason why you really need to centralize if you're self-hosting is probably more important than anyone else, is that you want maximum GPU utilization. GPUs are incredibly expensive.

If I'm not centralizing and I'm self-hosting, I might have five different teams all deploying a Qwen model or all deploying a Jet model. That just becomes really wasteful. We also want to be able to smooth load between use cases. That will help us get better utilization as well. Another reason why if you're self-hosting you really need to be centralizing your inference, is to monitor reliability and uptime. When you're self-hosting, you're dealing with a much more fragile ecosystem as well. If you are not self-hosting, so you're using a hosted provider. Hands up if you're using some hosted provider, so you're paying essentially per token? Also, very good reasons to centralize for boring reasons. You can negotiate bulk discounts. You can sign specific data retention policies. You can do things like negotiate higher rate limits, are really good reasons to centrally negotiate that.

A quick side note. My favorite thing in this entire presentation is that Nano Banana faked a signature for this particular image, which I thought was very funny. For all inference demands, I really need to be centralizing for a bunch of reasons. I might have different access policies. Let's say I have one model that's been fine-tuned on customer data. I maybe don't want the intern to have access to that. Auditability. I need to know what's gone in and out of all of the models. Cost controls, which are very important, which I'll come on to in a bit. I need to monitor reliability and uptime, things like that. Maybe as a company, I want to enforce guardrail policies. I also want to be able to put in things like data controls as well, which will help me with locality. My second takeaway, which I hope I've gone some way to convince you of, is that centralization at inference time is really important, if you care about ensuring governance and optimizing costs. We want to avoid that spaghetti situation, but still give people the tools.

AI Model Gateways

This is exactly what AI model gateways were designed for. There is obviously an analogy with traditional API gateways. This is from Kong, and I think a lot of people use API gateways. It is meaningfully different to have an API gateway versus having an AI model gateway. Reasons being is that the requests that you get from AI models just have different features than normal API requests. We want to be monitoring relevant things, which you might not be doing in a traditional gateway. I might want to be enforcing things like guardrails, which there isn't an analogy for. I need features like model-aware routing, which this wouldn't be sufficient for. I have an AI model gateway that can do a bunch of different things. I'll walk through each one by one.

The first one is we want unified API access. This is actually a benefit more for the use case teams than it is for the central teams. As a use case team, I want to be able to swap models in and out really easily. Sometimes this isn't easy unless you have a model gateway, because sometimes they use slightly different schemas. For access controls, so this is especially important for self-hosted models. Let's say I've deployed a model with vLLM or I've self-hosted on my own infrastructure. vLLM doesn't come with authentication or authorization. I need to be able to build in access controls there. I want to ensure that I have logging, monitoring, and auditability built in. When in 12 months' time we ask why did we make that decision, we know exactly why.

Or if later down the line I want to be fine-tuning a model, I still have the data to be able to fine-tune on that. Model routing is a valuable thing that you can get from AI model gateways. For example, I might have a situation where I want to route models based on the difficulty of the request. Or I might have a situation where I want to route models based on the load that I'm getting. Let's say ChatGPT or OpenAI is down again, and I want to automatically route all of my things to Anthropic or to another provider so I don't get downtime. Cost controls are incredibly valuable. We want to do this by groups. I'll come on to that in a little bit as well. Rate limit controls as well by groups as well, so I don't have that intern spending $4,000 over the weekend in a way that's very annoying. I also, as a company, might want to enforce things like guardrails, failovers, or just have unified prompt management as well, which model gateways can do.

This is an overview of how AI model gateways tend to look. I have a bunch of my apps, which is this decentralized innovation that I was talking about earlier. All of those apps are free to pick whichever model that they think is best for their use case. At the bottom, I have all of the models that I made available to my team. Maybe I have some self-hosted inference, batch ASIC inference. That's what we do. Maybe OpenAI, Anthropic as well. All of these different models that they can pick from. Every single request that comes from my use cases goes through my model gateway. I'll have a prompt and some metadata associated with that as well. That metadata might tell me things like application context. What's the SLA I need to get this back for? Does this need to be real time or can it wait 10 minutes? In this metadata, I might also have data handling requirements. Do I need to delete this data after a certain period of time? I'll have routing requirements as well. What model does it need to go to? Are there failovers that are acceptable? All of my requests then go through my model gateway, which does everything like my logging, my monitoring, my authentication, checking API keys.

Then send it to the model, comes back up. What's super important in this regime is because my model gateway is touching every single request, it is completely essential that it's very low latency and should be essentially invisible to my use case development teams because I don't want to be getting in their way. I just want to be making sure that we're being sensible as an organization. This model gateway can collect a bunch of information, which I can then store, which I can use to make things like chargeback reports, alerting, do things like user ratings and stuff like that. This is a very powerful place and acts as that linchpin for all my requests.

There's a bunch of different model gateway options that you can go home and try. The most popular is LiteLLM. This is the most fully featured. Here's ours. It's open source. It's the highest performance one. We have Portkey. They focus a lot on guardrails. There's Bifrost. They also claim to be pretty high performance. These top four are all open source. Then there's OpenRouter as well, which is pretty popular as well. My personal preference would be to go for an open source one for the reason of you want to be able to self-host this ideally. Using one of the open-source ones is a good place to start. Here are various options you can go home and try out. They're incredibly easy to use and get started with. Definitely try one out. I'm going to go through some of the features that these model gateways often provide. I'm taking screenshots from all of these different providers. A lot of the features are very similar. This controlling AI usage. Here is what a screen of my model gateway might look like from the use case development point of view. As a use case developer, that decentralized team, what I see is all of the models that I have access to. This will depend on the roles that I have and the groups that I have.

In this situation, I have a bunch of models that may be self-hosted and some other models that are API providers as well. I as a user can go into here and say, I want to use my GPT-5 Nano model for this use case, get an API key for that and start developing with it in a way that I know as a central team is all going through my model gateway. This is a super handy way for your use case teams to know what models are actually available. Most of the providers provide some interface like this. Very handy feature is request logs. These can sometimes be dangerous. You need to be careful about who gets access to what. One of the reasons why we want these AI model gateways is so we can actually track every request that's gone in and out and make sure that we're not violating policies, for example. What I want is a log of all of the requests that would come in, what model served it, and other metrics like how long did it take me, how much did it cost. RBAC can be a bit of a minefield, and so permissions are really important. We and I think a lot of the providers have standardized around building their solutions around groups. These are the easiest way to map your users to different models and different permissions. They're really nice.

If I go back here, I can do things like say, this claim detection model is maybe a fine-tuned model using PII information or something like that. I only want that to be given to X, Y, Z group. We could talk about budgets and rate limits, which is one of the reasons why people essentially get most excited by model gateways. I actually want to be able to control what they can spend and how much they're able to put through. This is taken, I think, from LiteLLM. Every time I'm creating a group, I might want to say, this group is maybe the intern budget, and my intern, she can't spend more than $15. I give her a budget of $15. I also probably want to give her a rate as well. This intern, I might say that her project is not a high priority project, so I'm going to give her a very low rate limit. I might give another team who is serving our mission critical flagship use case, give them a super high budget because I actually just want the use case to always work. These kinds of budgets and rate limits can be set at the model control layer. It's a necessary part of your team applying for access to get to that model.

The impact of all of this. My decentralized use case teams, so they really like these model gateways because it helps them to innovate much faster. We're not restricting them in a way that a lot of centralization can do. A lot of centralization could say things to the effect of, we're only using AWS Bedrock now, for example, and that's a way of restricting them. They still get all of the models and tools they care about. They can make sure that they can do their use cases very well. They can easily swap between models as well. When the general model comes out, but it has a slightly different schema, they don't have to worry about re-architecting that. For the central inference team, they get the controls that they want without having to put limitations on their decentralized use case team. They can ensure that consistent access control and governance procedures, and they don't have to worry about things like runaway spend. It's a pretty nice no-brainer. All of this to say as well that these model gateways are really easy to implement. Genuinely less than half a day of work, and they're easy to use, and very lightweight, and cheap to host as well. My third takeaway is that a model gateway is the right tool to ensure that we can actually control this usage in a central way while empowering these teams.

Takeaways

My three takeaways. Teams need the freedom to pick the right tools for their use case. If they don't, they will just build worse applications. Centralization at inference time is completely essential to ensure we have that governance and we can optimize for costs as well. An AI model gateway is a really nice tool to do that, and a bunch of open-source ways to try it out, LiteLLM, Doubleword, Bifrost. I'll also finish up by saying, if you want to read the blog that this talk was inspired by, check it out here, https://fergusfinn.com/blog/control-layer/. It's very good, not too long.

Questions and Answers

Participant 1: I was thinking this automatically would extend to something like an MCP gateway as well, or would you think that that would be like a separate gateway? Especially if it's like more scalable and you have things involved and it's not like a single request, but multiple requests from agents, and tools, and stuff like that.

Meryem Arik: This model gateway, what I've talked about here is quite a simplistic gateway where it's just essentially taking the LLM calls. It's a very simplistic thing. Actually, as the use case is getting more and more complex, there's more involved at inference time. It's not just MCP. There are other things that your model gateway should be the home of. One of them is agents, an agentic gateway. Here I've talked about these requests being like simple LLM calls, but that's actually not going to be the case. It's already starting to not be the case, where instead of calling an LLM, I might actually want to call a whole agent and say, can you do this whole part of this task? That suffers from the exact same problem. I want this model gateway to also be my agent gateway as well. I also want it to be my MCP gateway as well. I, as a business, don't want to say to my teams, use any MCP server you want, because there's a lot of security risks involved in that. I might want to say, we've pre-approved these MCP servers, go easy and use them, but we still control it in the same way. These model gateways, they're very new fields. I think most of these projects are no more than 12 months old, are already evolving into being agent gateways and MCP server gateways as well. I'm sure in 12 months, there'll be something else as well. It's a really nice linchpin point.

Participant 2: Building on that a little bit, do we have to accept that for some of the slightly more abstracted endpoints, we can't really address it, for instance, Notepad which is integrated within your Office suite. Whereas that can just talk, they just talk with your model, whether it's Copilot, or ChatGPT, or Gemini. Would you realistically be able to achieve a gateway from Copilot where you could still understand what the traffic I'm routing to it. Could you understand how it's arriving at cost, given that it's integrated into the platform?

Meryem Arik: I think the answer is, when it's natively embedded, it's very difficult to. It's up to what Microsoft wants to show you. That might be a reason to work with providers that do allow you to get that access control. On the flip side, Microsoft Office does give you other kinds of controls and other ways of due governance. Maybe that's a tradeoff that you're fine with. When I talked about gateway options, there are plenty more gateway options, but most of the other gateway options are vendor-specific. For example, Databricks has one, and Microsoft has one. I specifically didn't recommend it because I almost philosophically believe that this should be independent from the underlying infrastructure. I don't want to build my gateway attached to Databricks, because one day I might want to swap this out from Databricks and move to Snowflake or move to somewhere else. Having this be a layer that you can have as an independent layer is a really good idea, in my personal opinion.

At the moment, it's been very declarative. We say, I want access to this particular model. That doesn't necessarily feel like it should always be the right thing to do. For example, when you use ChatGPT, ChatGPT is constantly changing the models in the background, so does Claude, and so do all of them. They just essentially make the judgment of, this is the model that will answer your question and answer your queries. You could see a situation where your central team is almost just deciding the best thinking model for the task or the best data labeling model for the task.

The reason why I am reluctant to advertise that capability now, although some people do and some providers do, is because I don't think your model gateway should be smart. I think your model gateway should actually do as little work as possible because of that latency hit. Also, because I think use case teams need the freedom to be able to really design the experience that they need to design for their users. There are two schools of thought. There are some people, like Not Diamond, I think, is a company that does this, that does really smart routers and routes to the right model, and you don't know where it's going to. I am actually of the school of thought that the model gateway should be smarter and the use case teams have a bit more freedom there.

Participant 4: A question about security. Would these four gateways that you recommended be a good fit for a very enterprise-y environment where we would like to federate authentication, authorization to systems like Entra ID and things like that?

Meryem Arik: Yes. Our product is not building gateways. We actually built this for our enterprise clients and so it's been in production with them for a while, just incidentally. Most of these providers have SSO integrations. We certainly have SSO integrations, and they definitely are what I would call enterprise-ready, especially if you're using one of the self-hosted ones where you can deploy it in your own environment and then they'll connect to your SSO.

Participant 5: You spoke a lot about the internal use of AI gateways.

Meryem Arik: In this scenario are you talking about externally facing applications or just like the model itself is externally facing?

Participant 5: Yes, so our customers will be interacting with our AI gateway, not with their own agentic systems or their own use of our platforms that are connected to a public facing. The extension of connecting them as Teams or Google Agentspace, for that matter, where we have a trusted public endpoint that serves our published MCP servers.

Meryem Arik: That makes sense. You certainly already have a gateway because that's how you do key distribution and stuff. Are there different considerations? If you take the school of thought, which I do, which is your model gateway should be stupid and robust. Not really. Any specific to the fact that it's externally facing and internally facing level logic. The model gateway itself, I think, is fairly similar as a concept. Because as a centralized inference team, you could almost think of your job as serving externalized use case teams.

Participant 5: Then comparing these modern AI gateways to traditional API gateways and extending or taking something like Apigee and forming it to do an AI gateway?

Meryem Arik: We've seen some customers try and do this. Typically, they decide to not. The reason being is there's just a lot of convenience features that you get by working with an AI model gateway specifically. Stuff like, for example, prompt management is something you might want your model gateway to do. Or, as new models are released, you want native support for slightly different APIs and stuff. We've seen a couple, especially enterprise customers, who try and use their API model gateway, who then move over to an AI model gateway. Especially if, as we were talking about earlier, you might see these gateways also being your MCP server gateway, maybe your vector database gateway, maybe your agentic gateway as well. It makes sense for that to be a solution native to the AI regime. That's what we've seen. That there's sufficient difference for it to make sense for it to be its own thing.

Participant 6: At least my take, the general version of that would be, when you say enterprise ready, who within the enterprise should be the one that actually configures it? Because when it comes to access control, the way you're presenting it is, this is the way. Typically, that configuration is within whatever provides the authentication for the actual employees, or the endpoint. Can they integrate with Entra at that level? Can they configure it inside Microsoft and have it automatically be reflected in this platform? Or do we have to go separately into this platform, rent separate access tokens to the individual team? Because now there's a question of, if there's a team that has to be spun up specifically to address this, how do they interact with the actual team that does the actual authentication? How does this actually play out in the real world?

Meryem Arik: The way that it plays out typically is that the person that owns this platform is your platform team. A lot of enterprises are building AI platform teams that sit within the CTO office. The way that it interacts with your IDP providers is, you know how I talked about using groups? All of this should be imported from, for example, your Entra. I can, but I ideally don't want to be creating custom groups that don't match up with my IDP. I want to import engineering from that, and that should all live there and be imported in, in the spirit of this doing as little work as possible. The things that this might do is, say, I might map those groups into what permissions they have in here. All of the authentication and groups and stuff should live in Entra, wherever else.

Participant 7: There's a point you made about the model quality is something that teams could evaluate in terms of relative values. When you say model quality in terms of its output, how much of that is up to the model? How much of that is in the prompt?

Meryem Arik: I think you should think about quality in terms of prompt use case model triplets. If I am trying to, for example, get my model to write Shakespeare, if in my prompt, I ask it to write in the style of Charles Dickens, it's obviously not going to write in the style of Shakespeare. You shouldn't think of the model as being individually good at a thing. It's almost like a racehorse and its rider. You have to think of them in terms of pairs and triplets. I should never write off a model as being just bad for my use case. I should write off the model as being, with this prompt, it was really bad for my use case. The reason why use cases are very difficult to build centrally is because I need domain specific knowledge to actually iterate on that prompt.

Participant 8: What tools do they typically have for governance? Is this something where the model gateway is a central place and you can plug it into different things, or does it actually provide tools and integrations to do governance, to do guardrails?

Meryem Arik: The answer to this will change depending on how much work you think your model gateway should do. Which provider did I mention? Portkey, I think, has integrated guardrails. We, for example, don't. The reason why is because we think things like guardrails and governance is actually very application specific and use case specific and is better outside the regime of a model gateway. You can have it built in or not have it built in. Most of these providers are OpenTel native, and so will plug straight into things like Datadog or whatever else you're using, for your observability and monitoring.

Participant 9: When you mentioned keeping it lightweight and keeping it dumb, what are some of the patterns that you use to keep it that way? Because we want to be able to scale these models as well. Any time we involve any sort of state, it becomes really difficult to scale it. Maybe offloading state is one hard process. You mentioned a few others too. What are some of the patterns architecturally that you use to keep it dumb?

Meryem Arik: Keep it really dumb? We try to not keep it stateful if we can. We really built ours to focus on being very high performance. The reason we built our own and didn't use LiteLLM is the latency was super bad with it. For example, we wrote it from the ground up in Rust, which helped us keep it very efficient as well. We don't do things like forcing our customers to go through guardrails. Things like that adds a lot of latency as well. When we thought about what needs to go in a model gateway, we just tried to think about what is definitely replicable in every single company and every single application. Then everything else can be built around that using those solid foundations.

Participant 10: What is the latency number that you can add up to keep?

Meryem Arik: It should be essentially zero. Because in theory, it should be a super lightweight piece of tech. We did a bunch of benchmarking here somewhere. Here's benchmarking that we did versus Bifrost and LiteLLM. There are pretty significant differences. In terms of requests per second, the hardware setup was about 200. Someone like LiteLLM could do like 60, where they were adding a lot of overhead. I think they've now improved this. This benchmarking is about a month old. The ideal is that this is as close to the theoretical as possible. Our belief was that this model gateway should essentially be invisible. That's why we built our own and didn't use others. It should be very lightweight. Even something like LiteLLM, we say it's slow, but it was only slow for us because we were doing it at really large scale. If you're not doing it at huge scale, we know a bunch of people who use things like LiteLLM and love it.

Participant 11: When you were talking about how the AI gateway should also be used for the agents and the server gateway. Realistically, a company may have, let's just say a few hundred models, you could be already working in, that they can allow. They could realistically have thousands if not tens of thousands of agents. How would this platform manage something like that? I can see that working when you have thousands of models. Even at some degree, I can see it working. If you're talking about granting one of 20,000 agents to a specific gateway, that just seems really messy in reality.

Meryem Arik: Isn't it more messy to not have it?

Participant 11: In an API gateway, it's different. It's the team that would be creating the API and then they would have access to that API. If you have one centralized team that has access to thousands of things, that they provision to other teams in reverse of that model.

Meryem Arik: In this case, the use case teams are building the APIs. If I'm a use case team and I'm building an agentic workload, I then go and register it to this situation, my gateway, but I, as a use case team, are building it. What we don't expect is for central teams to build all the agents and be responsible for everything that provisions downstream teams. What typically we see is a request-based system. A team saying, I really want to use the new Qwen model. Can you register it via Bedrock or something? They do that, and then you get your API key.

See more presentations with transcripts

Recorded at:

May 20, 2026

Meryem Arik

InfoQ Software Architects' Newsletter

The AI Gateway: Scaling Centralized Inference Across Decentralized Teams

Summary

Bio

About the conference

Transcript

Outline

Inference Demands

Centralized Inference Capabilities

AI Model Gateways

Takeaways

Questions and Answers

Related Sponsors

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ