Auto-scaling pitfalls: our cluster scaled up but never back down

·573 views

we had a pretty frustrating auto-scaling incident recently. our ecs cluster scaled up from 10 to 50 containers during a sudden traffic spike, which was great. the problem was, it stayed at 50 containers for hours after the spike subsided. we dug into it and found a combination of factors: our scale-in thresholds were too conservative, and more critically, a 90-second health check grace period was delaying termination of 'unhealthy' instances even though they were fine, just not receiving traffic. it cost us a fair bit of money. what are the common auto-scaling pitfalls people have encountered, and what strategies or metrics do you use to ensure your clusters scale both up and back down efficiently without over-provisioning for extended periods?

10 comments

Auto-scaling pitfalls: our cluster scaled up but never back down

Comments