Latest curated articles from top engineering blogs
1031 articles
This article details Veltrix's architectural evolution to support 20,000 concurrent players in a real-time treasure hunt, focusing on overcoming unbounded state issues from long-lived WebSocket connections. The solution involved a two-tier architecture, separating ephemeral WebSocket handling from stateful processing using Rust, Kafka, and RocksDB to significantly reduce memory footprint and improve stability.
Meta's SilverTorch redefines recommendation system retrieval by consolidating disparate microservices into a unified, single neural network architecture. This "Index as Model" paradigm overcomes limitations of traditional microservice-based systems, such as latency due to data movement and version inconsistency, by integrating all retrieval components—ANN search, filtering, and scoring—directly into a PyTorch model. The new design significantly boosts throughput and cost efficiency while enabling more complex modeling and higher-quality recommendations within strict latency budgets.
This article details the architecture of a stateless JWT authentication microservice built with Spring Boot 3, focusing on high availability and performance. It emphasizes a cache-first approach using Redis to reduce database load and integrates Redis Sentinel for robust failover capabilities, ensuring the authentication service remains highly available in a microservice ecosystem.
This article outlines the architectural considerations for building a robust SMS gateway that intelligently routes messages across multiple carriers. It emphasizes the importance of an asynchronous message flow, dynamic carrier selection based on real-time and historical data, and comprehensive delivery tracking to ensure high delivery rates and compliance.
LinkedIn engineers successfully diagnosed a critical, ephemeral system freeze issue in their user feed's database, caused by kernel lock contention during large memory allocations. The breakthrough involved pioneering off-CPU profiling with eBPF and implementing automated diagnostic tooling. This case study highlights the importance of deep OS-level observability and careful memory management in high-performance distributed systems.
Airtable engineered a scalable and performant semantic search system to power its AI features, focusing on handling diverse customer database sizes and multi-tenancy. The architecture leverages Milvus for vector storage and search, with critical design decisions made around data partitioning, index selection, and managing hot/cold data to meet strict latency, throughput, and privacy requirements.
Snowflake's $6 billion commitment to AWS for Graviton and GPU instances signals a major strategic shift towards AI, focusing on leveraging cost-efficient compute for data warehousing and high-performance resources for AI model training and inference. This investment highlights critical architectural considerations for large-scale data platforms expanding into AI, particularly around cloud vendor strategy, infrastructure cost optimization, and data residency.
Stripe Radar has significantly expanded its AI-powered fraud prevention capabilities, moving beyond traditional credit card fraud to address new vectors like multi-account abuse, pay-as-you-go fraud, and malicious bots across various payment methods and processors. The system leverages global network data, custom models, and real-time evaluation to provide comprehensive risk assessment and dispute management. These enhancements highlight the evolving complexity of fraud detection in distributed payment systems.
This article discusses the practical application of AI in refactoring a legacy codebase, emphasizing how establishing strong architectural patterns, tests, and static analysis enables more autonomous and effective AI assistance. It highlights a shift in developer roles from writer to curator, focusing on defining patterns and strategic decisions while AI handles code generation. The piece also touches on the cognitive load of AI-augmented programming and broader societal impacts of AI.
This article distills 15 years of experience with distributed system failures into key lessons for system designers. It emphasizes that robust systems anticipate and gracefully handle failures, often contrary to overly optimistic monitoring. The core focus is on building resilient architectures by embracing chaos and designing fault-tolerant components.
This article discusses OpenCode's rapid growth as an AI coding tool and explores the broader implications of AI on software engineering practices and architectural decisions. It highlights how AI can impact development speed, product quality, tech debt management, and the continuing relevance of established design patterns.
This article presents a crucial system design lesson learned from a CQRS implementation where events and aggregate roots were stored in separate systems (Kafka and PostgreSQL). The initial distributed architecture led to severe performance issues and operational overhead. The authors describe their journey to consolidate events and aggregates into a single PostgreSQL database, leveraging logical replication as an event bus, dramatically improving latency and reducing costs.