🟠AWS Architecture Blog·January 12, 2026

Salesforce's Large-Scale Kubernetes Migration from Cluster Autoscaler to Karpenter

This article details how Salesforce, managing over 1,000 Amazon EKS clusters, successfully migrated from the traditional Kubernetes Cluster Autoscaler to Karpenter. The migration addressed significant scaling, resource utilization, and operational challenges inherent in their previous Auto Scaling Group-based architecture, leading to improved performance, cost savings, and enhanced developer experience.

Cloud & Infrastructure Distributed Systems DevOps & SRE

Read original on AWS Architecture Blog

Kubernetes cluster scaling is a critical aspect of managing large-scale containerized applications. Traditionally, this involved managing numerous node groups and intricate auto-scaling configurations. Salesforce, with one of the world's largest Kubernetes deployments, faced substantial challenges with the Kubernetes Cluster Autoscaler, particularly around scaling performance, resource utilization, and operational complexity. These issues motivated a strategic migration to Karpenter, an open-source node provisioning project by AWS designed for more efficient and responsive node management.

Challenges with Traditional Auto Scaling at Scale

Salesforce's previous architecture, reliant on AWS Auto Scaling groups and Cluster Autoscaler, struggled under the weight of thousands of node groups supporting diverse workloads across over 1,000 EKS clusters. Key problems included:

Operational Bottlenecks: Proliferation of node groups and Auto Scaling groups created significant management overhead.
Scaling Performance Issues: Multi-minute delays during demand spikes, leading to degraded user experience due to slow node provisioning.
Inefficient Resource Utilization: Poor bin-packing algorithms and conservative scale-down strategies resulted in stranded resources and underutilized infrastructure.
Architectural Limitations: Issues with Availability Zone balance and performance bottlenecks in large clusters, especially for memory-intensive workloads.

Karpenter Migration Strategy and Key Learnings

Given the scale, a highly automated, risk-mitigated transition was essential. Salesforce developed in-house tools for orchestration and validation, adhering to key design principles like zero disruption, rollback support, and CI/CD integration. The migration involved automating configuration mapping from legacy Auto Scaling Group definitions to Karpenter's <code class="language-text">EC2NodeClass</code> and <code class="language-text">NodePool</code> configurations.

ℹ️

Key Migration Learnings

During the phased rollout, Salesforce gained critical insights: managing application availability with Pod Disruption Budgets (PDBs), optimizing node maintenance workflows (e.g., sequential cordoning), understanding Kubernetes label constraints (63-character limit), protecting single-instance applications from aggressive consolidation, and precisely mapping storage requirements for I/O-intensive workloads.

Impact and Benefits of Karpenter Adoption

The transition to Karpenter delivered significant value across multiple dimensions:

Operational Efficiency: 80% reduction in manual overhead by eliminating thousands of node groups and enabling self-service for developers.
Performance Gains: Scaling latency reduced from minutes to seconds due to real-time node provisioning based on pending pods. Improved node utilization and elimination of Auto Scaling Group 'thrashing'.
Cost Optimization: Achieved 5% in cost savings (FY2026) with projections for an additional 5-10% (FY2027) through better bin-packing and reduced idle capacity.
Enhanced Developer Experience: True self-service infrastructure, allowing developers to define capacity needs and supporting heterogeneous instance types.

KubernetesKarpenterEKSAuto ScalingCloud MigrationNode ManagementCost OptimizationDevOps

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable and cost-efficient Kubernetes infrastructure for a multi-tenant SaaS platform that needs to support diverse workloads, similar to Salesforce's challenge. Focus on automating node provisioning and scaling, mitigating the challenges of traditional auto-scalers, and ensuring zero disruption during infrastructure changes. Include considerations for resource utilization, availability, and developer self-service.

Other design angles

· Design an automated node lifecycle management system for a large Kubernetes fleet, focusing on dynamic provisioning, graceful node termination, and handling diverse application requirements like single-replica pods and custom storage.· Evaluate and design a strategy for migrating a large-scale, production Kubernetes environment from a traditional Cluster Autoscaler to a more modern, pod-aware autoscaling solution like Karpenter, detailing the tooling, rollout phases, and potential pitfalls.· Propose a solution for optimizing cloud infrastructure costs in a Kubernetes environment with fluctuating workloads, using advanced bin-packing, consolidation, and dynamic node provisioning techniques, addressing both performance and cost trade-offs.