Menu

Software Architecture and System Design News

Latest curated articles from top engineering blogs

NetflixUberMetaLinkedInSpotifyGitHubAirbnbPinterestSlackDropboxCloudflareStripeDatadogFigmaShopifyAWSGoogle CloudAzureWerner Vogels& 15+ more

300 articles

InfoQ Architecture·2h ago

Identifying and Resolving Kernel Lock Contention in High-Scale Systems using eBPF

LinkedIn engineers successfully diagnosed a critical, ephemeral system freeze issue in their user feed's database, caused by kernel lock contention during large memory allocations. The breakthrough involved pioneering off-CPU profiling with eBPF and implementing automated diagnostic tooling. This case study highlights the importance of deep OS-level observability and careful memory management in high-performance distributed systems.

Distributed SystemsPerformance & Scaling
221241
Martin Fowler·2h ago

Leveraging AI for Codebase Refactoring and Architectural Improvement

This article discusses the practical application of AI in refactoring a legacy codebase, emphasizing how establishing strong architectural patterns, tests, and static analysis enables more autonomous and effective AI assistance. It highlights a shift in developer roles from writer to curator, focusing on defining patterns and strategic decisions while AI handles code generation. The piece also touches on the cognitive load of AI-augmented programming and broader societal impacts of AI.

DevOps & SRETools & Frameworks
26933
Medium #system-design·2h ago

Designing for Failure in Distributed Systems: Lessons from Production

This article distills 15 years of experience with distributed system failures into key lessons for system designers. It emphasizes that robust systems anticipate and gracefully handle failures, often contrary to overly optimistic monitoring. The core focus is on building resilient architectures by embracing chaos and designing fault-tolerant components.

Distributed SystemsDevOps & SRE
221397
The New Stack·14h ago

Designing a Context Lake for AI Agents: Bridging the Knowledge Gap

This article introduces the concept of a 'Context Lake' as a crucial architectural component for scaling AI agents within an organization. It highlights the challenges of security approvals, tool overload, and lack of organizational understanding that current AI agent integrations face. A Context Lake provides a unified, structured layer of organizational knowledge, enabling agents to query business context, relationships, and operational definitions beyond raw API access.

AI & ML InfrastructureDistributed Systems
1178225
Datadog Blog·1d ago

Measuring AI's Impact on Software Delivery Performance

This article discusses how to measure the impact of AI coding tools on software delivery performance using DORA metrics. It emphasizes evaluating AI tools based on their effect on key metrics like deployment frequency, lead time for changes, change failure rate, and time to restore service. This approach provides a data-driven framework for integrating and optimizing AI tools within the software development lifecycle.

DevOps & SREPerformance & Scaling
1358963
Dev.to #systemdesign·1d ago

Spotify's Evolution: From Autonomous Squads to Internal Developer Platforms with Golden Paths

This article details Spotify's architectural evolution, addressing developer experience challenges as the company scaled. It highlights the shift from highly autonomous squads, which led to infrastructure fragmentation, to a platform engineering model centered on "Golden Paths" and the Backstage developer portal. This strategic pivot significantly improved developer velocity and operational standardization by providing recommended, opinionated, and automated infrastructure solutions.

DevOps & SREMicroservices
1509258
The New Stack·1d ago

Governing AI-Assisted Development with the AC/DC Framework

This article introduces the Agent Centric Development Cycle (AC/DC) framework, a systematic approach for governing AI coding agents at scale. It emphasizes that while code generation speed is important, establishing trust and preventing downstream risks in machine-produced code requires robust guidance, verification, and remediation mechanisms. The framework focuses on shifting the engineering effort from human code authoring to designing a reliable system for steering and correcting AI-generated code.

AI & ML InfrastructureDevOps & SRE
15610517
The New Stack·2d ago

GitLab 19.0: Enhancing DevSecOps with Granular Secrets Management and AI-Driven Workflows

GitLab 19.0 introduces significant advancements in DevSecOps, focusing on reducing the 'AI paradox' through improved automation and security. Key architectural updates include a new Secrets Manager that enforces least privileged access for CI/CD variables and an expanded Developer Flow that leverages AI agents for project-specific workflow automation, enhancing overall software supply chain security and efficiency.

DevOps & SRESecurity
1759910
DZone Microservices·2d ago

Distributed GPU Training Debugging with eBPF and Client-Side Fan-Out

This article discusses an innovative approach to debugging distributed GPU training stalls across multiple nodes without requiring a central observability service. It highlights how an eBPF-based agent leverages client-side fan-out queries and offline merging of data to identify performance bottlenecks, specifically a straggler node, efficiently.

Distributed SystemsDevOps & SRE
14510453
The New Stack·3d ago

Monitoring and Observability for AI Agent Systems

This article discusses the emerging operational challenges of multi-agent AI systems in production, highlighting a critical lack of visibility compared to traditional microservices. It emphasizes the need for specialized monitoring to understand dynamic execution graphs, data flow, and deviations from normal agent behavior, which are essential for debugging performance, cost, and correctness issues.

AI & ML InfrastructureDevOps & SRE
17211260
Dev.to #architecture·3d ago

Rewriting an Event-Driven System Under Pressure: Lessons from a Treasure Hunt Engine

This article details a critical incident where a 'Treasure Hunt Engine' experienced severe event backlogs and cascading failures due to an inadequate event-driven architecture during peak loads. It outlines the architectural decisions made under immense pressure to rewrite the system within 48 hours, focusing on improving event processing throughput and system reliability. The key takeaway emphasizes the importance of robust event processing, proactive monitoring, and careful design for scalability in distributed systems.

Distributed SystemsPerformance & Scaling
20111542
MongoDB Blog·3d ago

Integrating MongoDB Atlas Logs for Enhanced Observability

This article introduces MongoDB Atlas's new log integration feature, allowing system and audit logs to be streamed directly to external observability platforms like Datadog, Splunk, or cloud storage solutions. It emphasizes the importance of unified telemetry for faster troubleshooting, improved compliance, and better operational efficiency in distributed systems by bridging the gap between metrics and granular log data.

DevOps & SREDatabases & Storage
13010544