Menu
The New Stack·March 18, 2026

Kubernetes Infrastructure for AI Workloads

This article discusses the evolving infrastructure requirements for AI workloads on Kubernetes, highlighting the challenges and community efforts to adapt Kubernetes for GPU-intensive and distributed AI tasks. It covers advancements in resource allocation, scheduling, and workload serving that are critical for efficiently running large-scale AI applications in open-source environments.

Read original on The New Stack

The proliferation of AI, particularly with open-source models, necessitates an equally open and adaptable infrastructure. Kubernetes, while not originally designed for AI, has become a de-facto platform. However, its core APIs and scheduling mechanisms have required significant enhancements to effectively manage GPU-bound, distributed AI workloads. The article highlights key areas where the community is bridging the gap between "possible" and "first-class" support for AI within Kubernetes.

Describing AI Hardware with Dynamic Resource Allocation (DRA)

The original Kubernetes device plugin API, which simply counted available GPUs, proved insufficient for modern AI needs. Advanced scenarios require granular control, such as partitioning shared GPUs, allowing multiple pods to share a single device, or requiring high-speed interconnects across nodes for distributed training. Dynamic Resource Allocation (DRA), reaching GA in Kubernetes 1.34, addresses these limitations by allowing vendors to expose structured device information via ResourceSlices and workloads to declare their specific needs using ResourceClaims. This enables the scheduler to make smarter placement decisions based on attributes, sharing policies, and topology.

Advanced Scheduling for AI Workloads

Distributed training and inference jobs often demand gang scheduling, ensuring all pods start simultaneously to prevent deadlocks and optimize resource utilization. Furthermore, understanding the physical topology of a cluster is crucial; placing interdependent pods on nodes sharing a network spine or high-speed interconnect dramatically reduces communication overhead. Tools like the KAI Scheduler (CNCF Sandbox project) and Topograph are emerging to provide DRA-aware gang scheduling, hierarchical queues with fairness policies, and topology-aware placement, pushing these capabilities upstream into the Kubernetes ecosystem.

Serving AI Workloads and Autoscaling Challenges

Inference workloads represent the majority of production GPU cycles. Traditional Kubernetes Horizontal Pod Autoscalers (HPAs) scale based on CPU and memory, which are often inadequate for LLM inference. LLMs require scaling based on metrics like KV-cache utilization, request queue depth, or time-to-first-token to optimize GPU usage and meet latency targets. Initiatives like Inference Gateway extend the Gateway API for model-aware routing, while projects like llm-d and Dynamo explore distributed serving with prefix-cache-aware routing and disaggregated prefill/decode, introducing new demands for intelligent scheduling and autoscaling mechanisms. The challenge lies in building abstractions that integrate these new primitives with higher-level control planes.

💡

Key Takeaway for System Designers

When designing systems to host AI/ML workloads on Kubernetes, consider not just the compute resources, but also the specialized needs for GPU orchestration, dynamic resource allocation, gang scheduling, and intelligent autoscaling based on ML-specific metrics. Standard Kubernetes features might require augmentation or specialized controllers.

KubernetesAIMLOpsGPUSchedulingResource ManagementAutoscalingCloud Native

Comments

Loading comments...