Cloudflare Builds High-Performance Infrastructure for Running LLMs

Cloudflare has recently announced new infrastructure designed to run large AI language models across its global network. As these models rely on costly hardware and must handle large volumes of incoming and outgoing text, Cloudflare separated the model's input processing and output generation onto different optimized systems and uses a custom inference engine to manage GPUs more efficiently.

According to the Cloudflare team, one key improvement is splitting model processing into two stages, each handled by a different machine: one stage reads and prepares the input text, while the other generates the output. Michelle Chen, principal product manager at Cloudflare, Kevin Flansburg, senior engineering manager at Cloudflare, and Vlad Krasnov, principal systems engineer at Cloudflare, write:

One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound.

Cloudflare also created a custom AI inference engine called Infire. Announced during Cloudflare Birthday Week 2025, Infire runs large language models across multiple GPUs more efficiently, reduces memory usage, and starts models more quickly, delivering faster responses.

Large language models such as Kimi K2.5 are so large (over 1 trillion parameters and about 560GB in size) that they must be split across multiple GPUs, requiring at least eight H100s just to load the model into memory, before accounting for additional memory used during processing. Explaining why Infire and the hardware optimizations help run huge models more efficiently and deliver faster responses to users, Chen, Flansburg, and Krasnov add:

For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible. For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency.

In a previous article, Cloudflare explained how to run open-source models on its AI inference platform, starting with Moonshot AI’s Kimi K2.5 model on Workers AI, and highlighted how the team is using a variety of hardware configurations to best serve models.

Cloudflare extra-large language models

Source: Cloudflare blog

Cloudflare also recently introduced Unweight, a system the company claims compresses large language model weights by about 15–22% without losing accuracy, reducing the amount of data GPUs need to load and move during inference, so models can run faster and more efficiently.

According to Cloudflare, the team further optimized Infire to reduce GPU memory usage for internal processes, allowing it to run Llama 4 Scout on just two H200 GPUs with large capacity for context tokens, and Kimi K2.5 on eight H100 GPUs while still leaving memory for KV-cache, configurations that "would have trouble even booting vLLM in the first place."

Cloudflare is not the only provider highlighting the challenges of running LLMs in production. Cockroach Labs' recent State of AI Infrastructure report states that as companies move AI systems into everyday use, many are finding their current infrastructure is not ready to handle the scale and reliability these workloads require. Cockroach Labs' analysis summarizes:

Legacy infrastructure, built around episodic human interaction, simply wasn’t designed for this kind of pressure. To handle the pace and unpredictability of AI, companies need more than performance upgrades. They need a fundamental shift in how systems are architected.

Cloudflare has separetely discussed how it optimized for efficient prompt caching.

About the Author

Renato Losio

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Renato Losio

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter