This article details how Salesforce, managing over 1,000 Amazon EKS clusters, successfully migrated from the traditional Kubernetes Cluster Autoscaler to Karpenter. The migration addressed significant scaling, resource utilization, and operational challenges inherent in their previous Auto Scaling Group-based architecture, leading to improved performance, cost savings, and enhanced developer experience.
Read original on AWS Architecture BlogKubernetes cluster scaling is a critical aspect of managing large-scale containerized applications. Traditionally, this involved managing numerous node groups and intricate auto-scaling configurations. Salesforce, with one of the world's largest Kubernetes deployments, faced substantial challenges with the Kubernetes Cluster Autoscaler, particularly around scaling performance, resource utilization, and operational complexity. These issues motivated a strategic migration to Karpenter, an open-source node provisioning project by AWS designed for more efficient and responsive node management.
Salesforce's previous architecture, reliant on AWS Auto Scaling groups and Cluster Autoscaler, struggled under the weight of thousands of node groups supporting diverse workloads across over 1,000 EKS clusters. Key problems included:
Given the scale, a highly automated, risk-mitigated transition was essential. Salesforce developed in-house tools for orchestration and validation, adhering to key design principles like zero disruption, rollback support, and CI/CD integration. The migration involved automating configuration mapping from legacy Auto Scaling Group definitions to Karpenter's <code class="language-text">EC2NodeClass</code> and <code class="language-text">NodePool</code> configurations.
Key Migration Learnings
During the phased rollout, Salesforce gained critical insights: managing application availability with Pod Disruption Budgets (PDBs), optimizing node maintenance workflows (e.g., sequential cordoning), understanding Kubernetes label constraints (63-character limit), protecting single-instance applications from aggressive consolidation, and precisely mapping storage requirements for I/O-intensive workloads.
The transition to Karpenter delivered significant value across multiple dimensions: