Menu

Software Architecture and System Design News

Latest curated articles from top engineering blogs

NetflixUberMetaLinkedInSpotifyGitHubAirbnbPinterestSlackDropboxCloudflareStripeDatadogFigmaShopifyAWSGoogle CloudAzureWerner Vogels& 15+ more

354 articles

InfoQ Architecture·8h ago

AWS Aurora Serverless v4: Faster Scaling and Improved Throughput for Database Workloads

AWS has significantly enhanced Aurora Serverless with Platform Version 4, offering 45% faster ramp-up during demand spikes and 30% higher throughput. These improvements stem from smarter scaling algorithms and better resource scheduling, making Aurora Serverless a more compelling option for dynamic and bursty workloads that benefit from automatic capacity adjustments.

Databases & StoragePerformance & Scaling
392411
Airbnb Engineering·8h ago

Migrating a High-Volume Metrics Pipeline to OpenTelemetry and Prometheus

This article details Airbnb's migration of a large-scale metrics pipeline from StatsD to OpenTelemetry and a Prometheus-based backend. It covers the architectural decisions, dual-write strategy, performance benefits of OTLP, the introduction of a streaming aggregation layer using vmagent for cost control and scalability, and a novel 'zero injection' solution for sparse counter accuracy issues.

Performance & ScalingDistributed Systems
442937
The New Stack·8h ago

Compute as the New Moat: Scaling AI Systems with Massive GPU Infrastructure

This article highlights how massive compute capacity, particularly GPU infrastructure, has become the critical limiting factor and competitive differentiator for frontier AI companies like Anthropic. It details Anthropic's strategic acquisition of significant compute resources, including a large-scale deal with SpaceX for NVIDIA GPUs, to support its ambitious product roadmap for AI agents. The core system design implication is the shift from model-centric development to infrastructure-centric scaling for advanced AI workloads.

AI & ML InfrastructureCloud & Infrastructure
362696
Pinterest Engineering·8h ago

Diagnosing CPU Bottlenecks and Network Driver Resets in Kubernetes on AWS

This article details Pinterest's complex journey to identify and resolve intermittent network connectivity issues in their Ray-based ML training jobs running on Kubernetes clusters backed by AWS EC2. The investigation uncovered CPU starvation affecting AWS ENA network drivers, leading to device resets and job crashes. The process highlights systematic debugging, profiling techniques, and the challenges of diagnosing transient performance bottlenecks in large-scale distributed systems.

Distributed SystemsPerformance & Scaling
362338
Pinterest Engineering·14h ago

Smarter URL Normalization for Content Deduplication at Scale

Pinterest engineered the Minimal Important Query Param Set (MIQPS) algorithm to dynamically identify and strip irrelevant URL parameters, crucial for deduplicating content at their vast scale. This system reduces redundant processing by distinguishing between parameters that affect content (e.g., product ID) and those that are purely for tracking, ultimately improving efficiency and catalog quality. The solution leverages content fingerprinting and a multi-layer normalization strategy combining static rules with learned dynamic ones.

Distributed SystemsPerformance & Scaling
353238
Meta Engineering·14h ago

Meta's AI Agent Platform for Hyperscale Capacity Efficiency

Meta developed a unified AI agent platform to automate finding and fixing performance issues across its vast infrastructure, enabling significant power savings and freeing up engineering time. This platform uses a two-layered architecture of standardized tools and encoded domain expertise (skills) to tackle both proactive optimization (offense) and reactive regression mitigation (defense). By centralizing these capabilities, Meta has built a self-sustaining efficiency engine that scales without proportionally increasing headcount, recovering hundreds of megawatts of power.

AI & ML InfrastructurePerformance & Scaling
372441
Martin Fowler·20h ago

Conceptual Integrity in System Design: Lessons from The Mythical Man-Month

This article highlights the enduring relevance of Fred Brooks's _The Mythical Man-Month_ in software development, particularly emphasizing Brooks's Law on adding manpower to late projects and, crucially, the importance of conceptual integrity in system design. It argues that a system's coherence and consistency, driven by a single set of design ideas, are paramount for its success and long-term maintainability, even if it means omitting some features.

Distributed SystemsMicroservices
503179
Azure Architecture Blog·20h ago

Microsoft Azure's Strategy for Scaling Cloud and AI Infrastructure in Europe

This article outlines Microsoft's significant investments in expanding Azure's cloud and AI infrastructure across Europe. It highlights the strategic focus on building scalable, resilient, and compliant data center regions to meet growing customer demand, emphasizing data residency, low-latency access, and sovereign cloud solutions. The expansion supports diverse workloads from critical business systems to advanced AI applications.

Cloud & InfrastructureDistributed Systems
503447
Dev.to #architecture·20h ago

Designing Peer-to-Peer Multi-Agent Architectures without a Central Hub

This article explores an alternative to traditional multi-agent system architectures that rely on a central coordinator or message hub. It highlights the scalability and reliability issues of centralized hubs and proposes a peer-to-peer approach using a session-layer protocol like Pilot Protocol. The core idea is to enable agents to discover and communicate directly, bypassing common bottlenecks associated with single points of failure.

Distributed SystemsPerformance & Scaling
533436
Meta Engineering·20h ago

Facebook Groups Search: Hybrid Retrieval Architecture with LLM Evaluation

Meta modernized Facebook Groups Search with a hybrid retrieval architecture, combining traditional lexical search with dense vector embeddings to improve discovery and relevance. This system addresses limitations of keyword-only search by understanding natural language intent and leverages multi-task learning for ranking and LLMs for automated evaluation, leading to significant improvements in user engagement.

Distributed SystemsAI & ML Infrastructure
433198
Airbnb Engineering·20h ago

Designing a Fault-Tolerant, Multi-Tenant Metrics Storage System at Scale

This article details Airbnb's journey in building a highly scalable and fault-tolerant metrics storage system capable of ingesting 50 million samples per second and managing 2.5 petabytes of time series data. It explores architectural decisions for multi-tenancy, operational challenges, and strategies for ensuring reliability and performance at immense scale, including single and multi-cluster deployments.

Distributed SystemsPerformance & Scaling
442776
Dev.to #systemdesign·20h ago

CAP Theorem Explained: Consistency, Availability, and Partition Tolerance in Distributed Systems

This article explains the fundamental CAP Theorem, which posits that a distributed system can only guarantee two out of Consistency, Availability, and Partition Tolerance at any given time. It clarifies that Partition Tolerance is unavoidable in distributed systems, forcing the real design choice between Consistency (CP) and Availability (AP) during network partitions. Using MySQL master-slave replication as an example, the article demonstrates how replication lag illustrates the AP tradeoff, where availability is prioritized over strong consistency.

Distributed SystemsDatabases & Storage
302786