Menu

Software Architecture and System Design News

Latest curated articles from top engineering blogs

NetflixUberMetaLinkedInSpotifyGitHubAirbnbPinterestSlackDropboxCloudflareStripeDatadogFigmaShopifyAWSGoogle CloudAzureWerner Vogels& 15+ more

333 articles

Dev.to #architecture·15d ago

Implementing a Circuit Breaker for AI Tool Calls to Prevent Cascading Failures

This article details the design and implementation of an MCP (Multi-protocol Communication Protocol) circuit breaker to prevent cascading failures in AI agent workflows. It focuses on how the circuit breaker pattern, a key distributed systems concept, can be applied to isolate flaky external tool calls and ensure system resilience. The post explores the state machine, failure handling, and configuration for robust operation at scale.

Distributed SystemsPerformance & Scaling
141893206
Medium #system-design·15d ago

System Design Glossary: Essential Concepts for Architects

This article serves as a crucial glossary, defining fundamental terms and concepts frequently encountered in system design and software architecture. It provides a shared reference for understanding complex distributed systems, architectural patterns, and scalability considerations, ensuring clarity across various system design discussions and analyses.

Distributed SystemsPerformance & Scaling
101464125
DZone Microservices·15d ago

Optimizing Hadoop Big Data Workloads on Arm-based AmpereOne Processors

This article explores the setup, tuning, and performance evaluation of Hadoop on AmpereOne Arm-based processors, highlighting their power efficiency and cost advantages for big data workloads. It delves into the architectural benefits of AmpereOne processors, Hadoop's compatibility with Arm, and provides practical guidance for deploying and optimizing Hadoop clusters on this infrastructure. The focus is on leveraging modern hardware for scalable and cost-effective big data processing.

Cloud & InfrastructureDatabases & Storage
65745342
GitHub Engineering·15d ago

Optimizing Large Pull Request Diffs at Scale

GitHub Engineering details their strategies for improving the performance of the 'Files changed' tab, particularly for large pull requests. This involved a multi-pronged approach combining component-level optimizations, UI virtualization, and broader rendering improvements to reduce DOM nodes, memory usage, and interaction latency, showcasing practical front-end architecture for highly interactive web applications at scale.

Performance & ScalingTools & Frameworks
68646332
InfoQ Cloud·15d ago

Replacing Database Sequences at Scale: A Distributed ID Generation System

This article details Coupang's journey to replace legacy database sequences with a highly available, low-latency distributed ID generation system without breaking over 100 existing services. The solution leverages local application caching, server-side caching, and DynamoDB as the source of truth, optimizing for performance and availability over strict global ordering and gap-free IDs. It highlights practical design principles for large-scale migrations, emphasizing simplicity and backward compatibility.

Distributed SystemsDatabases & Storage
54836030
Dev.to #systemdesign·15d ago

Architectural Deep Dive into Claude Code's LLM Agent Loop

This article dissects the core `while(true)` loop powering Claude Code's AI coding agent, revealing its state machine architecture for managing complex interactions with large language models and tools. It highlights critical design decisions like avoiding recursion for stack overflow prevention and implementing streaming tool execution for significant performance gains, showcasing a robust approach to building interactive AI agents.

AI & ML InfrastructureDistributed Systems
54936079
Medium #system-design·15d ago

Ensuring Data Integrity in Observability Platforms

This article discusses common pitfalls in observability platforms that lead to inaccurate data and offers practical strategies to ensure the integrity and reliability of monitoring and logging systems. It emphasizes the importance of understanding data lifecycles, proper instrumentation, and architectural considerations to prevent 'lying' platforms.

DevOps & SREDistributed Systems
52535775
InfoQ Architecture·15d ago

Replacing Database Sequences at Scale: A Cached, Distributed ID Generation System

This article details Coupang's journey to replace traditional database sequences with a highly scalable, available, and low-latency distributed ID generation system. It highlights critical design decisions, such as prioritizing eventual consistency and local caching over strict global ordering and network calls, to support over 100 services and facilitate a seamless migration from relational databases to NoSQL.

Distributed SystemsDatabases & Storage
54036753
Dev.to #systemdesign·16d ago

Scaling Challenges with Misused Vector Databases

This article highlights a common architectural pitfall where a system broke during scaling not due to performance bottlenecks, but incorrect database selection. The author mistakenly used a vector database for both similarity search and general data storage, leading to poor performance and scalability issues. The solution involved adopting a hybrid architecture, leveraging a vector database for its strengths (semantic search) and a traditional database for its (exact-match queries and structured data storage).

Databases & StorageDistributed Systems
49831487
Meta Engineering·16d ago

KernelEvolve: Optimizing AI Infrastructure through Autonomous Kernel Generation

This article introduces KernelEvolve, Meta's agentic kernel authoring system that autonomously generates and optimizes low-level hardware kernels for diverse AI models and heterogeneous hardware. It addresses the scalability bottleneck of manual kernel tuning by leveraging AI agents, search algorithms, and a feedback loop to significantly improve inference and training throughput.

AI & ML InfrastructurePerformance & Scaling
42827441
Dev.to #systemdesign·16d ago

Designing Multi-Region Architectures on AWS

This article explores the critical considerations, benefits, and challenges of implementing multi-region architectures, particularly focusing on AWS services. It breaks down the approach into distinct layers—networking, compute, application, data, and security—highlighting architectural decisions for fault tolerance, latency, and regulatory compliance, and emphasizing the role of Infrastructure as Code for successful deployment.

Distributed SystemsCloud & Infrastructure
36123944
ByteByteGo·16d ago

Database Performance Optimization Trade-offs

This article explores various strategies for optimizing database performance, emphasizing the inherent trade-offs associated with each. It highlights that while optimizations like indexing and caching improve specific aspects such as read speed, they can negatively impact others like write performance or data consistency. The core message is to understand the costs and benefits of each strategy to make informed architectural decisions based on application requirements.

Databases & StoragePerformance & Scaling
30918981