MongoDB Atlas's journey into predictive auto-scaling highlights a critical challenge in cloud infrastructure: efficiently managing dynamic resource allocation. Traditional reactive auto-scaling, while effective, often leads to periods of under- or over-provisioning due to the inherent latency in detecting and responding to load changes. This experiment explores a proactive approach to address these limitations.
The Problem with Reactive Auto-Scaling
Reactive auto-scaling algorithms typically scale resources after an overload or underload condition has been detected. For database services like MongoDB Atlas, this means:
- Performance Degradation: Servers can be overloaded for several minutes while the system detects the spike and initiates scaling, impacting user experience.
- Cost Inefficiency: Underloaded servers cost customers more than necessary.
- Slow Adaptation: Scaling often happens in incremental steps (e.g., M40 to M50, then to M60), requiring multiple operations to reach the optimal size for drastic demand changes, prolonging the impact of sub-optimal sizing.
- Scaling Interference: Extreme overload can sometimes interfere with the scaling operation itself.
Architecture of the Predictive Auto-Scaling Prototype
The MongoDB Atlas predictive auto-scaling prototype comprises three main components designed to forecast demand and select optimal instance sizes proactively:
- Forecaster: Predicts future workload for each replica set. This is further split into a Long-Term Forecaster (using MSTL for daily/weekly seasonality) and a Short-Term Forecaster (for local trends in non-seasonal workloads). A key architectural decision here is to forecast "customer-driven metrics" (e.g., queries per second) which are assumed to be independent of scaling actions, avoiding circular dependencies.
- Estimator: Estimates CPU utilization for any given workload and instance size. This component uses a regression model based on boosted decision trees trained on millions of samples, mapping forecasted demand to expected resource consumption.
- Planner: Selects the cheapest instance size that can handle the forecasted demand without exceeding a target CPU utilization (e.g., 75%). This involves evaluating different tier sizes based on the Forecaster's predictions and the Estimator's CPU projections.
💡Decoupling Forecast from Impact
A crucial system design insight from this work is the choice to forecast customer-driven metrics (like QPS) rather than directly forecasting CPU utilization. Forecasting CPU directly could lead to a feedback loop where scaling actions (which reduce CPU) invalidate the forecast. By predicting metrics exogenous to scaling, the system gains a more stable and reliable prediction input.
Challenges and Learnings
The experiment revealed several key insights for designing such a system:
- Predictability Varies: Not all replica sets exhibit predictable patterns. The system needs mechanisms (like a "self-censoring" confidence score for the Long-Term Forecaster) to identify when predictions are reliable.
- Short-Term vs. Long-Term: Often, short-term trends are more reliable signals for immediate scaling decisions than daily or weekly cycles, especially given the rapid changes in cloud workloads.
- Estimator Accuracy: Achieving high accuracy in the Estimator is challenging due to the opaque nature of customer queries and data. Error rates must be managed, and less predictable workloads might be excluded from predictive scaling.
- Conservative Rollout: The production version initially scales up proactively but relies on reactive scaling for scaling down, indicating a cautious approach to deploying complex predictive systems where over-provisioning (cost) is less critical than under-provisioning (performance).