InfoQ Homepage Kubernetes Content on InfoQ
-
NVIDIA Dynamo Addresses Multi-Node LLM Inference Challenges
Serving Large Language Models (LLMs) at scale is complex. Modern LLMs now exceed the memory and compute capacity of a single GPU or even a single multi-GPU node. As a result, inference workloads for 70B+, 120B+ parameter models, or pipelines with large context windows, require multi-node, distributed GPU deployments.
-
How Discord Scaled its ML Platform from Single-GPU Workflows to a Shared Ray Cluster
Discord has detailed how it rebuilt its machine learning platform after hitting the limits of single-GPU training. The changes enabled daily retrains for large models and contributed to a 200% uplift in a key ads ranking metric.
-
Helm Improves Kubernetes Package Management with Biggest Release in 6 Years
Helm, the Kubernetes application package manager, has officially reached version 4.0.0. Helm 4 is the first major upgrade in six years, and also marks Helm's 10th anniversary under the guidance of the Cloud Native Computing Foundation (CNCF). The update aims to address several challenges around scalability, security, and developer workflow.
-
Kubernetes Community Retires Popular Ingress NGINX Controller
The Kubernetes SIG Network and the Security Response Committee has announced the retirement of Ingress NGINX, one of the most widely deployed ingress controllers in the ecosystem. Best-effort maintenance will continue until March 2026, after which there will be no further releases, bug fixes, or security updates, according to an announcement made at Kubecon NA 2025.
-
KubeCon NA 2025 - Robert Nishihara on Open Source AI Compute with Kubernetes, Ray, PyTorch, and vLLM
AI workloads are growing more complex in terms of compute and data, and technologies like Kubernetes and PyTorch can help build production-ready AI systems to support them. Robert Nishihara from Anyscale recently spoke at KubeCon + CloudNativeCon North America 2025 Conference about how an AI compute stack comprising Kubernetes, PyTorch, VLLM and Ray technologies can support these new AI workloads.
-
Airbnb Adds Adaptive Traffic Control to Manage Key Value Store Spikes
Airbnb upgraded Mussel, its multi-tenant key-value store, replacing static per-client rate limits with an adaptive, resource-aware traffic control system. The redesign ensures resilience during traffic spikes, protects critical workflows, and maintains fair usage across thousands of tenants while scaling efficiently.
-
KubeCon NA 2025 - Erica Hughberg and Alexa Griffith on Tools for the Age of GenAI
Generative AI technologies need to support new workloads, traffic patterns, and infrastructure demands and require a new set of tools for the age of GenAI. Erica Hughberg from Tetrate and Alexa Griffith from Bloomberg spoke last week at KubeCon + CloudNativeCon North America 2025 Conference about what it takes to build GenAI platforms capable of serving model inference at scale.
-
Crossplane Reaches Production Maturity by Graduating CNCF
The Cloud Native Computing Foundation (CNCF) has graduated Crossplane, marking a major milestone for the open-source project that turns Kubernetes into a universal control plane for cloud infrastructure. For practitioners, it signals that Crossplane is no longer an experimental idea but a production-hardened foundation for building internal platforms.
-
KubeCon NA 2025 - Salesforce’s Approach to Self-Healing Using AIOps and Agentic AI
AIOps and Agentic AI technologies can help in developing solutions to intelligently analyze Kubernetes cluster health, automatically diagnose problems, and orchestrate issue resolutions with minimal human intervention. Vikram Venkataraman and Srikanth Rajan spoke at KubeCon + CloudNativeCon NA 2025 Conference about Salesforce’s approach to self-healing systems using AIOps and AI Agents.
-
CNCF Highlights How vCluster Eases Kubernetes Multi-Tenancy Challenges
The Cloud Native Computing Foundation (CNCF) published a blog post discussing how vCluster, an open-source project by Loft Labs, addresses key multi-tenancy obstacles in Kubernetes clusters by enabling "virtual clusters" within a single host cluster.
-
Groupe SNCF Modernizes Infrastructure with Talos OS and Kubernetes
Groupe SNCF, a major railway operator, has successfully migrated from traditional VM-based Kubernetes deployments to a cloud-native platform built on Talos OS and OpenStack, addressing significant operational challenges while navigating complex organizational change. After his talk at TalosCon 2025, InfoQ interviewed Thomas Comtet, senior staff engineer, about this migration.
-
Airbnb’s Mussel V2: Next-Gen Key Value Storage to Unify Streaming and Bulk Ingestion
Airbnb’s engineering team re-architected its internal key-value storage system, Mussel, to unify streaming and bulk ingestion while simplifying operations, achieving over 100,000 writes per second and sub-25ms read latencies on 100-terabyte tables, while leveraging Kubernetes, Kafka, and a NewSQL backend to improve scalability, reliability, and operational efficiency across its internal services.
-
Mirantis' Kubernetes Management Platform k0rdent Reaches v1.2.0
Mirantis has announced the release of version 1.2.0 of its open-source distributed container management platform k0rdent. They pitch k0rdent as a "super control plane" for helping platform engineers who manage Kubernetes infrastructure across multiple environments.
-
Talos Linux: Bringing Immutability and Security to Kubernetes Operations
Sidero Labs has been developing Talos Linux, an immutable operating system purpose-built exclusively for running Kubernetes, alongside Omni, a cluster lifecycle management platform. InfoQ met the Sidero team in Amsterdam during the TalosCon 2025 and had conversations about their approach to simplifying Kubernetes operations through minimalism and security-first design.
-
Azure Container Storage v2.0.0 Goes GA with Major Performance Boost
Microsoft has released Azure Container Storage v2.0.0, introducing significant performance enhancements and architectural simplifications for stateful workloads on Azure Kubernetes Service (AKS). The release focuses on deeper NVMe integration, streamlined user experience, and expanded open-source availability, while removing all service fees beyond underlying storage costs.