Menu

Software Architecture and System Design News

Latest curated articles from top engineering blogs

NetflixUberMetaLinkedInSpotifyGitHubAirbnbPinterestSlackDropboxCloudflareStripeDatadogFigmaShopifyAWSGoogle CloudAzureWerner Vogels& 15+ more

579 articles

Meta Engineering·8h ago

Meta's Post-Quantum Cryptography Migration Strategy

This article details Meta's comprehensive strategy for migrating to post-quantum cryptography (PQC) to protect against future quantum attacks. It outlines a multi-year, phased approach, emphasizing risk assessment, cryptographic inventory, and the adoption of PQC Maturity Levels to guide organizational readiness and deployment. The framework provides practical guidance for other organizations on transitioning critical systems to quantum-resistant standards.

SecurityDistributed Systems
462629
Spotify Engineering·8h ago

Building Natural Language Interfaces with Large Language Models and OpenAPI Specs

This article explores how Spotify leveraged Large Language Models (LLMs) and OpenAPI specifications to create a natural language interface for their Ads API. It details the architecture and process of transforming API definitions into a conversational tool, highlighting the implications for API design, developer experience, and system integration without requiring extensive compiled code.

API DesignAI & ML Infrastructure
322341
Azure Architecture Blog·8h ago

Azure Integrated HSM: Hardware-Enforced Cryptographic Trust at Scale

This article discusses Azure Integrated HSM, a Microsoft-built hardware security module integrated into every new Azure server. It extends cryptographic trust from silicon to services, enhancing key protection by ensuring keys never leave the hardware boundary during use. This architecture shifts security enforcement from policy to hardware, addressing scalability and performance challenges of traditional centralized HSMs.

SecurityCloud & Infrastructure
362166
Airbnb Engineering·8h ago

Migrating a High-Volume Metrics Pipeline to OpenTelemetry and Prometheus

This article details Airbnb's migration of a large-scale metrics pipeline from StatsD to OpenTelemetry and a Prometheus-based backend. It covers the architectural decisions, dual-write strategy, performance benefits of OTLP, the introduction of a streaming aggregation layer using vmagent for cost control and scalability, and a novel 'zero injection' solution for sparse counter accuracy issues.

Performance & ScalingDistributed Systems
412869
Pinterest Engineering·8h ago

Diagnosing CPU Bottlenecks and Network Driver Resets in Kubernetes on AWS

This article details Pinterest's complex journey to identify and resolve intermittent network connectivity issues in their Ray-based ML training jobs running on Kubernetes clusters backed by AWS EC2. The investigation uncovered CPU starvation affecting AWS ENA network drivers, leading to device resets and job crashes. The process highlights systematic debugging, profiling techniques, and the challenges of diagnosing transient performance bottlenecks in large-scale distributed systems.

Distributed SystemsPerformance & Scaling
332221
Dev.to #systemdesign·8h ago

Designing Reliable Embedded Bootloaders for System Recovery

This article discusses the critical role of bootloaders in embedded systems, emphasizing their importance for system reliability and recovery from firmware corruption or update failures. It compares architectural approaches across MCUs, Linux, and FPGA platforms, highlighting common pitfalls and best practices for robust bootloader design to ensure product resilience.

Distributed SystemsSecurity
422290
Pinterest Engineering·14h ago

Smarter URL Normalization for Content Deduplication at Scale

Pinterest engineered the Minimal Important Query Param Set (MIQPS) algorithm to dynamically identify and strip irrelevant URL parameters, crucial for deduplicating content at their vast scale. This system reduces redundant processing by distinguishing between parameters that affect content (e.g., product ID) and those that are purely for tracking, ultimately improving efficiency and catalog quality. The solution leverages content fingerprinting and a multi-layer normalization strategy combining static rules with learned dynamic ones.

Distributed SystemsPerformance & Scaling
353079
The New Stack·14h ago

Architecting AI Coding Agents: Remote Control, Plugin Systems, and Cloud-Native CLI

This article explores the architectural shift in AI coding agents, moving from local, editor-bound sessions to more autonomous, cloud-based operations. It highlights Amp's Neo CLI redesign, which facilitates remote control, leverages a plugin system, and adopts a "compaction-first" architecture to manage long-running agent workflows efficiently, emphasizing the terminal's evolving role as a control surface for distributed agent systems.

Distributed SystemsAI & ML Infrastructure
482887
AWS Architecture Blog·14h ago

Designing a Multi-Tenant Data Exchange Platform for Supply Chain Carbon Footprint Tracking on AWS

This article details the system design of PACIFIC, a multi-tenant SaaS platform built on AWS for exchanging product carbon footprint (PCF) data across complex automotive supply chains. It highlights architectural decisions focused on achieving strict data sovereignty, multi-tenancy without dedicated AWS accounts, and interoperability with external data spaces like Catena-X, using services such as Amazon ECS, AWS Fargate, Amazon Cognito, and AWS IAM.

Distributed SystemsCloud & Infrastructure
362691
Meta Engineering·14h ago

Meta's AI Agent Platform for Hyperscale Capacity Efficiency

Meta developed a unified AI agent platform to automate finding and fixing performance issues across its vast infrastructure, enabling significant power savings and freeing up engineering time. This platform uses a two-layered architecture of standardized tools and encoded domain expertise (skills) to tackle both proactive optimization (offense) and reactive regression mitigation (defense). By centralizing these capabilities, Meta has built a self-sustaining efficiency engine that scales without proportionally increasing headcount, recovering hundreds of megawatts of power.

AI & ML InfrastructurePerformance & Scaling
372400
Azure Architecture Blog·14h ago

Azure IaaS Security: Defense in Depth and Secure-by-Design Principles

This article outlines how Microsoft Azure IaaS implements a robust security architecture based on defense-in-depth and Secure Future Initiative (SFI) principles: secure by design, secure by default, and secure in operation. It details how security is embedded across hardware, hypervisor, networking, storage, and operations, ensuring a multi-layered and continuously adapting protection strategy. The focus is on architectural decisions that minimize attack surfaces and mitigate threats at every level of the infrastructure stack.

SecurityCloud & Infrastructure
342615
InfoQ Architecture·14h ago

GKE Agent Sandbox and Hypercluster: Scaling Kubernetes for AI Workloads

Google's latest GKE updates, Agent Sandbox and Hypercluster, address critical challenges in deploying and scaling AI workloads on Kubernetes. Agent Sandbox provides kernel-level isolation for untrusted agent code, crucial for multi-agent AI workflows, while Hypercluster offers a single control plane to manage up to a million accelerator chips, simplifying large-scale AI infrastructure management.

Cloud & InfrastructureDistributed Systems
462376