Menu
Netflix Tech Blog·June 22, 2026

Netflix's Migration to Kueue for Kubernetes-Native Batch Compute

Netflix successfully migrated its custom batch compute queuing and scheduling logic from its homegrown Compute Managed Batch (CMB) solution to Kueue, a cloud-native job queuing system for Kubernetes. This strategic move aimed to leverage the evolving Kubernetes ecosystem, gaining features like preemption and multi-tenant quota management, and simplifying their compute infrastructure. The migration focused on transparent user experience and maintaining throughput, ultimately leading to increased resource utilization and informing future Kubernetes-native infrastructure efforts.

Read original on Netflix Tech Blog

Transitioning Batch Compute to Kubernetes-Native

Netflix's journey towards a more Kubernetes-native compute infrastructure led them to replace their custom batch queuing and scheduling system, Compute Managed Batch (CMB), with Kueue. CMB was a managed batch solution built in 2018, providing tenant hierarchy, ordered execution with priorities, and per-tenant capacity management running on Titus, Netflix's container platform. The decision to migrate was driven by the maturation of open-source Kubernetes batch offerings that provided features CMB lacked or found cumbersome to implement, such as preemption and advanced capacity management. This highlights a common architectural trade-off: building custom solutions vs. adopting robust open-source alternatives as an ecosystem matures, especially when the custom solution becomes a bottleneck for new feature development.

Understanding Compute Managed Batch (CMB) Architecture

CMB featured a hierarchical tenant structure, allowing organizations to group jobs and manage capacity. It supported two types of tenants: Internal Tenants for hierarchical grouping and Leaf Tenants for actual job submission. Capacity was managed through two types: Reserved Capacity, which provided guaranteed resources for specific tenants, and Shared Capacity, a global pool that tenants could burst into. A key limitation of CMB was its lack of preemption; once a job was admitted, it ran to completion, which could lead to suboptimal fair sharing and resource utilization. Titus further provided workload federation and federated capacity reservations across multiple Kubernetes cells/clusters, abstracting the underlying infrastructure from CMB.

ℹ️

Architectural Considerations for Batch Systems

When designing batch compute systems, critical features include hierarchical tenant management for multi-tenancy, priority-based queuing for workload differentiation, and robust capacity management (reserved vs. shared). The ability to preempt lower-priority jobs for higher-priority ones is crucial for maximizing resource utilization and ensuring SLAs for critical workloads. Without preemption, resource deadlocks or starvation can occur, especially in shared environments.

Why Kueue: A Strategic Architectural Choice

Netflix chose Kueue over alternatives like YuniKorn or Volcano for several key reasons: Kueue integrates seamlessly with the existing `kube-scheduler`, avoiding fragmentation of job placement. It supports multi-tenant quota management over heterogeneous hardware and handles various Kubernetes primitives and higher-level abstractions. Crucially, Kueue provided native features that were difficult to implement in CMB, such as preemption, all-or-nothing scheduling, and topology-aware scheduling. This decision underscores the value of adopting tools that align with an existing infrastructure's core components (like `kube-scheduler`) to minimize disruption and leverage a vibrant open-source ecosystem.

Migration Strategy and Outcomes

The migration, dubbed "Netflix Batch," was executed with strict tenets: zero-lift for end-users, no regressions in launch rate or throughput, and a complete replacement of CMB's queuing and scheduling logic. This was achieved by maintaining API parity for users and converting CMB's internal tenant structures into Kueue's Cohorts, ClusterQueues, and LocalQueues. Lessons learned included the benefit of migrating the most complex use cases first to build confidence and derisk the project. Post-migration, Netflix observed a significant increase in average resource utilization, demonstrating the effectiveness of Kueue's preemption-based fair sharing in dynamically allocating idle reserved capacity and ensuring quicker turnaround times for critical workloads.

  • Improved Fair Sharing: Kueue's preemption allows better utilization of reserved capacity by temporarily lending resources to other tenants when not in use.
  • Enhanced Prioritization: Lower-priority workloads can be preempted for higher-priority ones, preventing starvation and ensuring business-critical jobs run efficiently.
  • Increased Resource Utilization: The dynamic nature of Kueue's scheduling mechanisms led to a significant increase in overall cluster resource utilization.
KubernetesBatch ProcessingKueueJob SchedulingCloud NativeResource ManagementPreemptionFair Sharing

Comments

Loading comments...