High Scalability·July 11, 2022

Scalability Insights from Industry: High Scalability July 2022 Digest

This digest compiles various data points, observations, and quotes from the tech industry, offering a broad perspective on system scalability, performance, and architectural decisions. It covers topics ranging from the operational efficiency of Stack Overflow and Pinterest's Memcached fleet to the cost implications of serverless architectures and the challenges of managing large-scale infrastructure migrations at Epic Games.

Distributed Systems Performance & Scaling Cloud & Infrastructure

Read original on High Scalability

Introduction to Scalability Insights

This article serves as a curated collection of real-world statistics, anecdotal evidence, and expert opinions on various aspects of system scalability, performance, and architectural choices. It highlights how different companies and individuals approach challenges related to high traffic, data volumes, and cost optimization, providing valuable lessons for system designers.

Key System Design Takeaways

Efficiency at Scale: Stack Overflow runs 200 sites serving 6000 requests/second with 2 billion views/month on just 9 on-prem servers, demonstrating high efficiency and low utilization (<10%). This points to careful optimization and perhaps the advantage of a monolithic architecture for certain workloads.
Distributed Data Processing: Riot Games utilizes 20+ shards for millions of users, handling 500,000 events/second peak and generating 8 TB of data daily, buffering data with Kafka on-prem before moving to AWS. This illustrates a common pattern for handling massive event streams and data ingestion.
Caching at Scale: Pinterest's Memcached fleet spans over 5000 EC2 instances across various types, serving up to ~180 million requests/second and ~220 GB/s network throughput over a ~460 TB active dataset. This emphasizes the critical role of distributed caching in high-volume systems.
Serverless Economics: AWS Lambda is highlighted as a cost-effective solution for stateless, on-demand workloads, achieving 100% utilization. The example of Branch's AWS bill growing to $10K/month while supporting 15x YoY growth, with DynamoDB ($4k/mo) and Lambda ($2200/mo) as primary costs, showcases the financial scalability of serverless.
Infrastructure Migration Challenges: Epic Games' Store migration from raw EC2 to Kubernetes, including domain changes and new AWS accounts, aimed at SEO, single point of failure reduction, and significant engineering quality-of-life improvements (e.g., prod deploy time from 45 mins to 6 mins). This case study provides insights into complex re-platforming initiatives.
Database Selection: Plausible's move from PostgreSQL to ClickHouse enabled them to count over a billion page views per month while maintaining a fast-loading dashboard, indicating the importance of selecting specialized databases for analytical workloads.
Importance of Algorithms: Leslie Lamport stresses that designing concurrent systems without thinking in terms of algorithms leads to bugs, emphasizing the foundational role of theoretical computer science in practical system development.
API Design Trade-offs: The discussion on GraphQL points out the trade-off between flexible query capabilities (generic graph database) and the potential for infinite performance work if queries are not locked down, contrasting it with traditional APIs.

💡

Scaling Fallacies

A common misconception is that all apps need to prepare for extreme scale. Chris Munns notes that 99% of apps never break 1000 RPS, which can easily be handled by serverless functions (Lambda, Fargate) or even modest EC2 setups. The focus should often be on efficient resource utilization, caching, and async patterns rather than over-provisioning for theoretical scale.

Architecture Patterns and Tools Mentioned

Kafka: Used by Riot Games for high-volume event buffering and data movement.
Memcached: Critical distributed caching layer for Pinterest's ~180M requests/sec.
AWS Lambda/Fargate/DynamoDB: Key components in serverless architectures for scalable, cost-efficient operations.
Kubernetes: Utilized by Epic Games for improved deployment, autoscaling, and developer experience during a major migration.
ClickHouse: Chosen by Plausible for high-performance analytical queries over large datasets.
GraphQL: Presented as an alternative API design approach with considerations for flexibility vs. performance management.

scalabilityperformanceserverlesskafkamemcachedkubernetesawsdatabase

Comments

Loading comments...

Architecture Design

Design this yourself

Design a highly scalable and cost-optimized analytics platform capable of processing over a billion page views per month and handling peak event rates of 500,000 events per second. The system should integrate both real-time data ingestion for immediate insights and batch processing for historical analysis, leveraging serverless components where appropriate and considering specialized databases for different workloads.

Practice Interview

Other design angles

· Design a distributed caching layer for a social media platform that handles 180 million requests per second, focusing on cache invalidation strategies, data partitioning, and consistency models.· Architect a cloud migration strategy for a large-scale e-commerce platform moving from on-premise EC2 to Kubernetes, with a focus on minimizing downtime, optimizing costs, and improving CI/CD pipelines.· Propose an API architecture for a multi-service platform, evaluating the trade-offs between a traditional REST API gateway and a GraphQL-based approach in terms of query flexibility, performance, and maintenance overhead.