Software Architecture and System Design News

Latest curated articles from top engineering blogs

Netflix

Uber

Scaling Real-Time Treasure Hunts: Solving Unbounded State with a Two-Tier Architecture

This article details Veltrix's architectural evolution to support 20,000 concurrent players in a real-time treasure hunt, focusing on overcoming unbounded state issues from long-lived WebSocket connections. The solution involved a two-tier architecture, separating ephemeral WebSocket handling from stateful processing using Rust, Kafka, and RocksDB to significantly reduce memory footprint and improve stability.

Distributed SystemsPerformance & Scaling

Meta Engineering·just now

SilverTorch: Unifying Recommendation Retrieval into a Single Neural Network

Meta's SilverTorch redefines recommendation system retrieval by consolidating disparate microservices into a unified, single neural network architecture. This "Index as Model" paradigm overcomes limitations of traditional microservice-based systems, such as latency due to data movement and version inconsistency, by integrating all retrieval components—ANN search, filtering, and scoring—directly into a PyTorch model. The new design significantly boosts throughput and cost efficiency while enabling more complex modeling and higher-quality recommendations within strict latency budgets.

AI & ML InfrastructureDistributed Systems

151

DZone Microservices·just now

Designing a Stateless JWT Authentication Microservice with Redis Sentinel

This article details the architecture of a stateless JWT authentication microservice built with Spring Boot 3, focusing on high availability and performance. It emphasizes a cache-first approach using Redis to reduce database load and integrates Redis Sentinel for robust failover capabilities, ensuring the authentication service remains highly available in a microservice ecosystem.

MicroservicesSecurity

1110

Dev.to #systemdesign·just now

Designing a Resilient SMS Gateway with Intelligent Carrier Routing

This article outlines the architectural considerations for building a robust SMS gateway that intelligently routes messages across multiple carriers. It emphasizes the importance of an asynchronous message flow, dynamic carrier selection based on real-time and historical data, and comprehensive delivery tracking to ensure high delivery rates and compliance.

Distributed SystemsAPI Design

InfoQ Architecture·just now

Identifying and Resolving Kernel Lock Contention in High-Scale Systems using eBPF

LinkedIn engineers successfully diagnosed a critical, ephemeral system freeze issue in their user feed's database, caused by kernel lock contention during large memory allocations. The breakthrough involved pioneering off-CPU profiling with eBPF and implementing automated diagnostic tooling. This case study highlights the importance of deep OS-level observability and careful memory management in high-performance distributed systems.

Distributed SystemsPerformance & Scaling

187

ByteByteGo·just now

Airtable's Semantic Search Architecture for AI Features

Airtable engineered a scalable and performant semantic search system to power its AI features, focusing on handling diverse customer database sizes and multi-tenancy. The architecture leverages Milvus for vector storage and search, with critical design decisions made around data partitioning, index selection, and managing hot/cold data to meet strict latency, throughput, and privacy requirements.

Databases & StorageDistributed Systems

2100

The New Stack·just now

Snowflake's Strategic Cloud Infrastructure Investment for AI Expansion

Snowflake's $6 billion commitment to AWS for Graviton and GPU instances signals a major strategic shift towards AI, focusing on leveraging cost-efficient compute for data warehousing and high-performance resources for AI model training and inference. This investment highlights critical architectural considerations for large-scale data platforms expanding into AI, particularly around cloud vendor strategy, infrastructure cost optimization, and data residency.

Cloud & InfrastructureAI & ML Infrastructure

1129

Stripe Blog·just now

Stripe Radar's AI-Powered Fraud Prevention System Enhancements

Stripe Radar has significantly expanded its AI-powered fraud prevention capabilities, moving beyond traditional credit card fraud to address new vectors like multi-account abuse, pay-as-you-go fraud, and malicious bots across various payment methods and processors. The system leverages global network data, custom models, and real-time evaluation to provide comprehensive risk assessment and dispute management. These enhancements highlight the evolving complexity of fraud detection in distributed payment systems.

Distributed SystemsSecurity

Martin Fowler·just now

Leveraging AI for Codebase Refactoring and Architectural Improvement

This article discusses the practical application of AI in refactoring a legacy codebase, emphasizing how establishing strong architectural patterns, tests, and static analysis enables more autonomous and effective AI assistance. It highlights a shift in developer roles from writer to curator, focusing on defining patterns and strategic decisions while AI handles code generation. The piece also touches on the cognitive load of AI-augmented programming and broader societal impacts of AI.

DevOps & SRETools & Frameworks

249

Medium #system-design·just now

Designing for Failure in Distributed Systems: Lessons from Production

This article distills 15 years of experience with distributed system failures into key lessons for system designers. It emphasizes that robust systems anticipate and gracefully handle failures, often contrary to overly optimistic monitoring. The core focus is on building resilient architectures by embracing chaos and designing fault-tolerant components.

Distributed SystemsDevOps & SRE

160

The Pragmatic Engineer·just now

OpenCode's Growth and the Evolving Role of AI in Software Engineering

This article discusses OpenCode's rapid growth as an AI coding tool and explores the broader implications of AI on software engineering practices and architectural decisions. It highlights how AI can impact development speed, product quality, tech debt management, and the continuing relevance of established design patterns.

AI & ML InfrastructureDistributed Systems

Dev.to #architecture·just now

Rethinking Event Sourcing: Consolidating Events and Aggregates in PostgreSQL

This article presents a crucial system design lesson learned from a CQRS implementation where events and aggregate roots were stored in separate systems (Kafka and PostgreSQL). The initial distributed architecture led to severe performance issues and operational overhead. The authors describe their journey to consolidate events and aggregates into a single PostgreSQL database, leveraging logical replication as an event bus, dramatically improving latency and reducing costs.

Distributed SystemsDatabases & Storage

180