Auto-scaling pitfalls: our cluster scaled up but never back down

·1 view

We had a pretty rough incident last week where our ECS service scaled up like crazy but refused to scale back down. During a peak traffic spike, our main API service went from its usual 10 containers to 50 in a matter of minutes, which was great for handling the load. The problem was, hours later, we were still running 50 containers, burning money. Turns out, our scale-in thresholds were way too conservative. We had a `CPU utilization < 30% for 15 minutes` rule, which sounded reasonable, but with our traffic patterns, CPU would frequently dip for 5-10 minutes, then spike again, preventing the 15-minute window from ever being met. Compounding this, our health checks take about 90 seconds to fully register a healthy container, which seemed to add to the overall inertia of the system. Has anyone else wrestled with similar auto-scaling 'stickiness'? What strategies have you found effective for more aggressive, yet safe, scale-down behavior, especially with services that have longer startup or health check times?

4 comments

Auto-scaling pitfalls: our cluster scaled up but never back down

Comments