Netflix successfully migrated its custom batch compute queuing and scheduling logic from its homegrown Compute Managed Batch (CMB) solution to Kueue, a cloud-native job queuing system for Kubernetes. This strategic move aimed to leverage the evolving Kubernetes ecosystem, gaining features like preemption and multi-tenant quota management, and simplifying their compute infrastructure. The migration focused on transparent user experience and maintaining throughput, ultimately leading to increased resource utilization and informing future Kubernetes-native infrastructure efforts.
Read original on Netflix Tech BlogNetflix's journey towards a more Kubernetes-native compute infrastructure led them to replace their custom batch queuing and scheduling system, Compute Managed Batch (CMB), with Kueue. CMB was a managed batch solution built in 2018, providing tenant hierarchy, ordered execution with priorities, and per-tenant capacity management running on Titus, Netflix's container platform. The decision to migrate was driven by the maturation of open-source Kubernetes batch offerings that provided features CMB lacked or found cumbersome to implement, such as preemption and advanced capacity management. This highlights a common architectural trade-off: building custom solutions vs. adopting robust open-source alternatives as an ecosystem matures, especially when the custom solution becomes a bottleneck for new feature development.
CMB featured a hierarchical tenant structure, allowing organizations to group jobs and manage capacity. It supported two types of tenants: Internal Tenants for hierarchical grouping and Leaf Tenants for actual job submission. Capacity was managed through two types: Reserved Capacity, which provided guaranteed resources for specific tenants, and Shared Capacity, a global pool that tenants could burst into. A key limitation of CMB was its lack of preemption; once a job was admitted, it ran to completion, which could lead to suboptimal fair sharing and resource utilization. Titus further provided workload federation and federated capacity reservations across multiple Kubernetes cells/clusters, abstracting the underlying infrastructure from CMB.
Architectural Considerations for Batch Systems
When designing batch compute systems, critical features include hierarchical tenant management for multi-tenancy, priority-based queuing for workload differentiation, and robust capacity management (reserved vs. shared). The ability to preempt lower-priority jobs for higher-priority ones is crucial for maximizing resource utilization and ensuring SLAs for critical workloads. Without preemption, resource deadlocks or starvation can occur, especially in shared environments.
Netflix chose Kueue over alternatives like YuniKorn or Volcano for several key reasons: Kueue integrates seamlessly with the existing `kube-scheduler`, avoiding fragmentation of job placement. It supports multi-tenant quota management over heterogeneous hardware and handles various Kubernetes primitives and higher-level abstractions. Crucially, Kueue provided native features that were difficult to implement in CMB, such as preemption, all-or-nothing scheduling, and topology-aware scheduling. This decision underscores the value of adopting tools that align with an existing infrastructure's core components (like `kube-scheduler`) to minimize disruption and leverage a vibrant open-source ecosystem.
The migration, dubbed "Netflix Batch," was executed with strict tenets: zero-lift for end-users, no regressions in launch rate or throughput, and a complete replacement of CMB's queuing and scheduling logic. This was achieved by maintaining API parity for users and converting CMB's internal tenant structures into Kueue's Cohorts, ClusterQueues, and LocalQueues. Lessons learned included the benefit of migrating the most complex use cases first to build confidence and derisk the project. Post-migration, Netflix observed a significant increase in average resource utilization, demonstrating the effectiveness of Kueue's preemption-based fair sharing in dynamically allocating idle reserved capacity and ensuring quicker turnaround times for critical workloads.