Introduction to Scalability Insights
This article serves as a curated collection of real-world statistics, anecdotal evidence, and expert opinions on various aspects of system scalability, performance, and architectural choices. It highlights how different companies and individuals approach challenges related to high traffic, data volumes, and cost optimization, providing valuable lessons for system designers.
Key System Design Takeaways
- Efficiency at Scale: Stack Overflow runs 200 sites serving 6000 requests/second with 2 billion views/month on just 9 on-prem servers, demonstrating high efficiency and low utilization (<10%). This points to careful optimization and perhaps the advantage of a monolithic architecture for certain workloads.
- Distributed Data Processing: Riot Games utilizes 20+ shards for millions of users, handling 500,000 events/second peak and generating 8 TB of data daily, buffering data with Kafka on-prem before moving to AWS. This illustrates a common pattern for handling massive event streams and data ingestion.
- Caching at Scale: Pinterest's Memcached fleet spans over 5000 EC2 instances across various types, serving up to ~180 million requests/second and ~220 GB/s network throughput over a ~460 TB active dataset. This emphasizes the critical role of distributed caching in high-volume systems.
- Serverless Economics: AWS Lambda is highlighted as a cost-effective solution for stateless, on-demand workloads, achieving 100% utilization. The example of Branch's AWS bill growing to $10K/month while supporting 15x YoY growth, with DynamoDB ($4k/mo) and Lambda ($2200/mo) as primary costs, showcases the financial scalability of serverless.
- Infrastructure Migration Challenges: Epic Games' Store migration from raw EC2 to Kubernetes, including domain changes and new AWS accounts, aimed at SEO, single point of failure reduction, and significant engineering quality-of-life improvements (e.g., prod deploy time from 45 mins to 6 mins). This case study provides insights into complex re-platforming initiatives.
- Database Selection: Plausible's move from PostgreSQL to ClickHouse enabled them to count over a billion page views per month while maintaining a fast-loading dashboard, indicating the importance of selecting specialized databases for analytical workloads.
- Importance of Algorithms: Leslie Lamport stresses that designing concurrent systems without thinking in terms of algorithms leads to bugs, emphasizing the foundational role of theoretical computer science in practical system development.
- API Design Trade-offs: The discussion on GraphQL points out the trade-off between flexible query capabilities (generic graph database) and the potential for infinite performance work if queries are not locked down, contrasting it with traditional APIs.
💡Scaling Fallacies
A common misconception is that all apps need to prepare for extreme scale. Chris Munns notes that 99% of apps never break 1000 RPS, which can easily be handled by serverless functions (Lambda, Fargate) or even modest EC2 setups. The focus should often be on efficient resource utilization, caching, and async patterns rather than over-provisioning for theoretical scale.
- Kafka: Used by Riot Games for high-volume event buffering and data movement.
- Memcached: Critical distributed caching layer for Pinterest's ~180M requests/sec.
- AWS Lambda/Fargate/DynamoDB: Key components in serverless architectures for scalable, cost-efficient operations.
- Kubernetes: Utilized by Epic Games for improved deployment, autoscaling, and developer experience during a major migration.
- ClickHouse: Chosen by Plausible for high-performance analytical queries over large datasets.
- GraphQL: Presented as an alternative API design approach with considerations for flexibility vs. performance management.