📈High Scalability·July 16, 2023

Scaling Presto at Meta: Lessons in Deployment, Automation, and Robustness for Large-Scale Data Warehouses

This article details Meta's ten years of experience running and scaling Presto, a SQL query engine, highlighting key architectural and operational lessons. It focuses on strategies for ensuring high availability during frequent deployments, automating cluster lifecycle management, implementing advanced debugging and remediation, and building a robust load balancer for handling immense query volumes. The insights offer valuable considerations for anyone operating analytical query engines at a massive scale.

Distributed Systems DevOps & SRE Performance & Scaling

Read original on High Scalability

Introduction to Scaling Presto at Meta

Operating a data warehouse query engine like Presto at Meta's scale presents significant challenges, particularly concerning deployment, cluster management, debugging, and traffic handling. The article shares critical lessons learned from over a decade of continuous operation, emphasizing automation and resilience as foundational pillars for maintaining performance and availability for interactive and batch query workloads.

Ensuring Availability During Rapid Deployments

Meta frequently deploys new Presto releases (1-2 times per month) across a global fleet of clusters. To ensure continuous availability, a load balancer (Gateway) is crucial. When a cluster is updated, it's gracefully drained by the Gateway, allowing existing queries to complete before the update. After the update, the cluster is brought back online and registered with the Gateway. Automation ensures that a sufficient number of clusters remain available in each data center, balancing rapid deployment with uninterrupted service.

Automating Cluster Lifecycle Management

The dynamic nature of Meta's data warehouse requires constant provisioning and decommissioning of Presto clusters. This process, initially manual, was fully automated by standardizing cluster configurations and integrating with company-wide infrastructure services. New clusters are spun up by generating configurations from templates, running test queries, and then registering with the Gateway. Decommissioning follows a reverse, automated process, which significantly reduces operational overhead and human error.

Automated Debugging and Remediation for Operational Efficiency

💡

Proactive Problem Solving

At scale, manual debugging becomes a bottleneck. Investing in automated analyzers that aggregate data from various monitoring systems and logs, infer root causes, and even trigger automated remediations (e.g., draining 'bad' hosts) is essential for maintaining SLAs and reducing on-call burden.

With a large Presto deployment, automated tooling is critical for on-call teams. Meta developed 'analyzers' that correlate data from monitoring systems, event logs, and host logs to identify root causes for issues like 'bad hosts' or queueing problems. This automation extends to self-healing mechanisms, such as automatically draining hosts causing excessive query failures, thereby improving system reliability and reducing MTTR.

Building a Robust and Scalable Load Balancer

The Gateway, acting as the central load balancer for all Presto queries, evolved significantly to handle Meta's scale. Initially simple, it encountered stability issues under heavy, unpredictable loads (unintentional DDoS). Enhancements included implementing throttling based on various dimensions (user, source, IP, global) to reject excess traffic and integrating autoscaling with a Meta-wide service. These measures ensure the Gateway remains resilient and available even during traffic spikes.

Key Takeaways for Scaling Data Lakehouses with Presto

Well-defined SLAs: Crucial for prioritizing and mitigating production issues, tracking customer pain points.
Comprehensive Monitoring & Automated Debugging: Essential for early detection, root cause analysis, and reducing manual intervention at scale.
Good Load Balancing: Vital for efficient routing and resilience against uneven traffic patterns.
Configuration Management: Standardizing configurations and enabling hot reloads minimizes disruption during updates.

PrestoMetaSQL EngineScalabilityAutomationDeploymentLoad BalancingMonitoring