Auto-scaling pitfalls: our cluster scaled up but never back down

·99 views

we had a pretty rough incident last month where our ecs cluster auto-scaled up to 50 containers during a traffic spike, which was expected, but then it stayed at 50 containers for hours after the spike subsided. essentially, it scaled up fine but never scaled back down. turns out, our scale-in thresholds were too conservative, and we also had a 90-second health check grace period that was delaying the scale-in process significantly. the cost implications were pretty bad. what are your best practices for configuring auto-scaling, especially for ecs or kubernetes, to ensure it's aggressive enough on scale-up but also scales down efficiently? any war stories or specific metrics/configurations you found useful to prevent over-provisioning after a spike?

16 comments

Auto-scaling pitfalls: our cluster scaled up but never back down

Comments