The Azure Kubernetes Service team shared a detailed guide on how to use Dynamic Resource Allocation (DRA) with NVIDIA vGPU technology on AKS. his update improves control and efficiency for shared GPU use in AI and media tasks.
Dynamic Resource Allocation (DRA) is now the standard for GPU resource use in Kubernetes. Instead of static resources like nvidia.com/gpu, GPUs are allocated dynamically using DeviceClasses and ResourceClaims. This change enhances scheduling and improves integration with virtualization technologies like NVIDIA vGPU.
The reason for combining these technologies is clear: virtual accelerators like NVIDIA vGPU often handle smaller tasks. They allow one physical GPU to be split among many users or applications. This setup is helpful for enterprise AI/ML development, fine-tuning, and audio/visual processing. vGPU offers predictable performance while still providing CUDA capabilities to containerized workloads.
On the infrastructure side, this feature relies on Azure's NVadsA10_v5 virtual machine series. Instead of assigning the whole GPU to one VM, vGPU technology partitions it into multiple fixed-size slices at the hypervisor layer. From Kubernetes' view, each VM shows one clear GPU device. The hypervisor sets capacity and memory limits, not the software.
The setup requires Kubernetes 1.34 or newer. At this point, DRA primitives like deviceclasses and resourceslices are available. Teams provision a node pool with NVadsA10_v5 instances and apply a label (nvidia.com/gpu.present=true) for the NVIDIA DRA kubelet plugin as its node selector. They then deploy the NVIDIA DRA driver via Helm. The post highlights three important Helm flags for vGPU scenarios. The gpuResourcesEnabledOverride=true flag skips a check that prevents the NVIDIA DRA driver from installing with the legacy device plugin due to different GPU names. FeatureGates.IMEXDaemonsWithDNSNames=false disables an IMEX feature that requires a newer GRID driver version than what's supported on the A10 series in Azure.
-1773844201509.jpg)
Once the driver is active, it scans each node, detects the single vGPU device from the Azure VM, and registers it to the Kubernetes control plane as a DRA-managed device. Each node registers one allocatable device because that’s what the VM presents. Operators can check the setup by looking at the gpu.nvidia.com DeviceClass and ResourceSlices. This way, they can confirm that the control plane has found the available hardware.
Beyond the baseline one-sixth slice (Standard_NV6ads_A10_v5), the series offers a one-third profile with 8 GB of accelerator memory and a one-half profile with 12 GB. Limits are enforced at the hypervisor layer, so AKS sees a single GPU device with predictable capacity. This gives platform teams flexibility to size GPU allocation based on workload needs without overprovisioning nodes.
The AKS team frames the broader significance as directional. As GPUs become first-class resources in Kubernetes, combining virtualized GPU with DRA offers a practical way to run shared, production-grade workloads. For large AKS deployments, especially in regulated or cost-sensitive industries, optimal GPU placement and utilization directly impact job throughput and infrastructure efficiency. Using DRA with vGPU helps organizations move from coarse node-level allocation to controlled, workload-driven GPU use at scale.
Google Cloud is pursuing a similar path on GKE, focusing on DRA as a scheduling primitive for both GPUs and TPUs. GKE's DRA support lets workloads use CEL expressions to filter devices with specific attributes. This allows a single manifest to deploy to different clusters with various GPU types without changes. Specifically for vGPU, Google recently previewed fractional G4 VMs using NVIDIA vGPU technology based on the RTX PRO 6000 Blackwell GPU, managed through GKE and combined with container binpacking for higher utilization. When scheduled via Google's Dynamic Workload Scheduler, fallback priorities can improve resource access.
Amazon EKS takes a different approach, using DRA mainly to simplify the complexity of its high-end GPU hardware instead of fractional sharing. Amazon EKS made DRA generally available starting in Kubernetes version 1.33. This technology is essential for P6e-GB200 UltraServer instances, where traditional static GPU scheduling cannot model the NVLink and IMEX interconnect needed for multi-node workloads. For teams running smaller workloads wanting GPU sharing on EKS, DRA now supports structured, attribute-based requests. This allows schedulers to respond to requests like "a 10 GB MIG partition with at least 1/7th compute" instead of treating GPUs as simple counts. Across all three cloud providers, the shift from static device plugins to DRA is accelerating, driven by the need for more expressive, topology-aware GPU scheduling as AI infrastructure grows in complexity and cost.