Latest curated articles from top engineering blogs
300 articles
LinkedIn engineers successfully diagnosed a critical, ephemeral system freeze issue in their user feed's database, caused by kernel lock contention during large memory allocations. The breakthrough involved pioneering off-CPU profiling with eBPF and implementing automated diagnostic tooling. This case study highlights the importance of deep OS-level observability and careful memory management in high-performance distributed systems.
This article discusses the practical application of AI in refactoring a legacy codebase, emphasizing how establishing strong architectural patterns, tests, and static analysis enables more autonomous and effective AI assistance. It highlights a shift in developer roles from writer to curator, focusing on defining patterns and strategic decisions while AI handles code generation. The piece also touches on the cognitive load of AI-augmented programming and broader societal impacts of AI.
This article distills 15 years of experience with distributed system failures into key lessons for system designers. It emphasizes that robust systems anticipate and gracefully handle failures, often contrary to overly optimistic monitoring. The core focus is on building resilient architectures by embracing chaos and designing fault-tolerant components.
This article introduces the concept of a 'Context Lake' as a crucial architectural component for scaling AI agents within an organization. It highlights the challenges of security approvals, tool overload, and lack of organizational understanding that current AI agent integrations face. A Context Lake provides a unified, structured layer of organizational knowledge, enabling agents to query business context, relationships, and operational definitions beyond raw API access.
This article discusses how to measure the impact of AI coding tools on software delivery performance using DORA metrics. It emphasizes evaluating AI tools based on their effect on key metrics like deployment frequency, lead time for changes, change failure rate, and time to restore service. This approach provides a data-driven framework for integrating and optimizing AI tools within the software development lifecycle.
This article details Spotify's architectural evolution, addressing developer experience challenges as the company scaled. It highlights the shift from highly autonomous squads, which led to infrastructure fragmentation, to a platform engineering model centered on "Golden Paths" and the Backstage developer portal. This strategic pivot significantly improved developer velocity and operational standardization by providing recommended, opinionated, and automated infrastructure solutions.
This article introduces the Agent Centric Development Cycle (AC/DC) framework, a systematic approach for governing AI coding agents at scale. It emphasizes that while code generation speed is important, establishing trust and preventing downstream risks in machine-produced code requires robust guidance, verification, and remediation mechanisms. The framework focuses on shifting the engineering effort from human code authoring to designing a reliable system for steering and correcting AI-generated code.
GitLab 19.0 introduces significant advancements in DevSecOps, focusing on reducing the 'AI paradox' through improved automation and security. Key architectural updates include a new Secrets Manager that enforces least privileged access for CI/CD variables and an expanded Developer Flow that leverages AI agents for project-specific workflow automation, enhancing overall software supply chain security and efficiency.
This article discusses an innovative approach to debugging distributed GPU training stalls across multiple nodes without requiring a central observability service. It highlights how an eBPF-based agent leverages client-side fan-out queries and offline merging of data to identify performance bottlenecks, specifically a straggler node, efficiently.
This article discusses the emerging operational challenges of multi-agent AI systems in production, highlighting a critical lack of visibility compared to traditional microservices. It emphasizes the need for specialized monitoring to understand dynamic execution graphs, data flow, and deviations from normal agent behavior, which are essential for debugging performance, cost, and correctness issues.
This article details a critical incident where a 'Treasure Hunt Engine' experienced severe event backlogs and cascading failures due to an inadequate event-driven architecture during peak loads. It outlines the architectural decisions made under immense pressure to rewrite the system within 48 hours, focusing on improving event processing throughput and system reliability. The key takeaway emphasizes the importance of robust event processing, proactive monitoring, and careful design for scalability in distributed systems.
This article introduces MongoDB Atlas's new log integration feature, allowing system and audit logs to be streamed directly to external observability platforms like Datadog, Splunk, or cloud storage solutions. It emphasizes the importance of unified telemetry for faster troubleshooting, improved compliance, and better operational efficiency in distributed systems by bridging the gap between metrics and granular log data.