KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI - InfoQ

Generative AI technologies need to support new workloads, traffic patterns, and infrastructure demands and require a new set of tools for the age of GenAI. Erica Hughberg from Tetrate and Alexa Griffith from Bloomberg spoke last week at KubeCon + CloudNativeCon North America 2025 Conference about what it takes to build GenAI platforms capable of serving model inference at scale.

The new requirements for Gen AI-based applications include dynamic, model-based routing, token-level rate limiting, secure & centralized credential management, and observability, resilience & failover for AI. Existing tools are not sufficient to support these use cases due to their lack of AI-native logic, simple rate limiting, and request-based routing. Kubernetes and tools such as KServe, vLLM, Envoy, and llm-d can be used to implement these new requirements. For monitoring and observability of AI applications, we can leverage frameworks such as OpenTelemetry, Prometheus, and Grafana.

The speakers discussed their AI application architecture developed using open source projects such as Envoy AI Gateway and KServe. Envoy AI Gateway helps manage traffic at the edge. It provides unified access for application clients to GenAI services such as the Inference Service or the Model Context Protocol (MCP) Server. Its design is based on a two-tier gateway pattern, with Tier One Gateway, referred to as the AI Gateway, functioning as a centralized entry point and responsible for authentication, top-level routing, a unified LLM API, and token-based rate limiting. It can also act as an MCP proxy.

And the Tier Two Gateway, referred to as the Reference Gateway, manages ingress traffic to the AI models hosted on a Kubernetes cluster and is also responsible for fine-grained access control to the models. Envoy AI Gateway supports various AI providers, including OpenAI, Azure OpenAI, Google Gemini, Vertex AI, AWS Bedrock, and Anthropic.

KServe is the open-source standard for self-hosted models, providing a unified platform for generative and predictive AI inference on Kubernetes. As a single, declarative API for models, it can provide a stable, internal endpoint for each model that the Envoy AI Gateway can route traffic to. It's recently been retooled to support Generative AI capabilities, including multi-framework LLM support, OpenAI-compatible APIs, LLM model caching, KV cache offloading, multi-node inference, metric-based autoscaling, and native support for Hugging Face models with streamlined deployment workflows.

KServe provides a Kubernetes custom resource definition (CRD) built on llm-d, a Kubernetes-native LLM inference framework, for serving models from different frameworks such as PyTorch, TensorFlow, ONNX, or HuggingFace. The CRD's K8s configuration YAML script includes the type InferenceService, which allows specifying the model metadata and the gateway API for external access.

Hughberg and Griffith concluded the presentation by reiterating that GenAI brings stateful, resource-intensive, and token-based workloads. You will need AI-native capabilities like dynamic, model-based routing, and token-level rate limiting & cost control. CNCF tools like Kubernetes, Envoy AI Gateway, and KServe can help with developing Gen AI-based applications.

About the Author

Srini Penchikala

Show moreShow less

InfoQ Software Architects' Newsletter

KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI

Write for InfoQ

About the Author

Srini Penchikala

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter