BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

Listen to this article -  0:00

Microsoft and NVIDIA have released Part 2 of their collaboration on running NVIDIA Dynamo for large language model inference on Azure Kubernetes Service (AKS). The first announcement aimed for a raw throughput of 1.2 million tokens per second on distributed GPU systems. Now, this latest release focuses on helping developers work faster and improving operational efficiency. It does this through automated resource planning and dynamic scaling features.

The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the "rate matching" challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which process the input context, from decode operations that generate output tokens. These tasks run on different GPU pools. Without the right tools, teams spend a lot of time determining the optimal GPU allocation for these phases.

The Dynamo Planner Profiler is a pre-deployment simulation tool. It automates the search for the best configurations. Developers can skip manually testing various parallelization strategies and GPU counts, saving hours of GPU utilization. Instead, they define their needs in a DynamoGraphDeploymentRequest (DGDR) manifest. The profiler runs an automated sweep of the configuration space. It tests different tensor parallelism sizes for both prefill and decode stages. This helps find settings that boost throughput while staying within latency limits.

The profiler includes an AI Configurator mode that can simulate performance in approximately 20 to 30 seconds based on pre-measured performance data. This capability allows teams to rapidly iterate on configurations before allocating physical GPU resources. The output gives a tuned setup to boost what teams call "Goodput." This is the highest possible throughput while staying within set limits for Time to First Token and Inter-Token Latency.

Once a system enters production, the SLO-based Dynamo Planner takes over as a runtime orchestration engine. This component is "LLM-aware", which means that, unlike traditional load balancers, it keeps an eye on the cluster state. It tracks things like key-value cache load in the decode pool and the depth of the prefill queue. The Planner uses the profiler's performance bounds to scale prefill and decode workers. This helps meet service level goals as traffic patterns change.

The announcement illustrates these capabilities through a detailed airline assistant scenario. In this case, a Qwen3-32B-FP8 model supports an airline mobile app. It follows strict service level agreements: 500 milliseconds for Time to First Token and 30 milliseconds for Inter-Token Latency. During normal operations with short passenger queries, the system runs with one prefill worker and one decode worker. When a weather disruption leads to 200 users sending complex rerouting requests, the Planner notices the spike. It then scales up to two prefill workers but keeps one decode worker. The teams report that the new worker comes online within minutes, allowing the system to maintain latency targets during the traffic spike.

This release builds on the framework introduced in the original Dynamo announcement, which InfoQ covered in December 2024. In the last article, Azure and NVIDIA explained how Dynamo's design splits compute-heavy and memory-bound tasks across various GPUs. This allows teams to optimize each phase independently, matching resources to workload needs. For example, an e-commerce app's prefill task may process thousands of tokens, while its decode task only generates short descriptions.

The move from manual setup to automated, SLO-driven resource management shows how teams can better handle large language model deployment on Kubernetes. The Planner components provide tools that turn latency needs into GPU allocation and scaling choices. This aims to lower the operational burden of running disaggregated inference architectures. Automation tools can help organizations with reasoning-heavy or long-context LLMs. They make it easier to manage the complex multi-node GPU setups. They also support meeting service level goals during changing traffic patterns.

About the Author

Rate this Article

Adoption
Style

BT