Menu

Software Architecture and System Design News

Latest curated articles from top engineering blogs

NetflixUberMetaLinkedInSpotifyGitHubAirbnbPinterestSlackDropboxCloudflareStripeDatadogFigmaShopifyAWSGoogle CloudAzureWerner Vogels& 15+ more

45 articles

🔹Azure Architecture Blog·9h ago

Agentic Cloud Operations: AI-Powered Automation for Cloud Management

This article introduces agentic cloud operations, a new paradigm for managing complex cloud environments using AI-powered agents. It highlights how these agents can automate and optimize various operational tasks across the cloud lifecycle, from migration and deployment to optimization and troubleshooting, ensuring continuous improvement and adaptability.

DevOps & SRECloud & Infrastructure
14
☁️Cloudflare Blog·9h ago

Reimagining Next.js Architecture with Vite and AI for Serverless Environments

This article discusses Cloudflare's project, Vinext, a re-implementation of the Next.js API surface directly on Vite, aimed at improving deployment to serverless platforms like Cloudflare Workers. It highlights architectural challenges with traditional Next.js deployments in serverless environments and proposes a new approach leveraging Vite's ecosystem and AI for rapid development and optimized performance.

Cloud & InfrastructureMicroservices
9
🟠AWS Architecture Blog·15h ago

Migrating a Monolith to Event-Driven Architecture at Amazon Key

This article details Amazon Key's migration from a tightly coupled monolithic system to a resilient event-driven architecture using Amazon EventBridge. It highlights the challenges of the legacy system, including service coupling and inconsistent event management, and presents the design of a modern solution focusing on schema governance, client-side validation, and efficient multi-service integration.

Distributed SystemsMicroservices
30
👩‍💻Dev.to #architecture·15h ago

Designing Shared File Storage with Azure Files for Geo-Distributed Offices

This article outlines the architecture and deployment of a highly available and secure shared file storage solution using Azure Files for geographically dispersed corporate offices. It emphasizes balancing performance with security, leveraging Azure's Zone-Redundant Storage (ZRS) for resilience, snapshots for data integrity, and Virtual Networks for zero-trust access control.

Cloud & InfrastructureDatabases & Storage
23
📦Dropbox Tech·15h ago

Low-Bit Inference for Efficient AI Model Deployment at Scale

This article from Dropbox Tech explores low-bit inference techniques, specifically quantization, as a critical strategy for making large AI models more efficient, faster, and cheaper to run in production. It delves into how reducing numerical precision impacts memory, compute, and energy, and the architectural considerations for deploying these optimized models on modern hardware like GPUs, addressing latency and throughput constraints for real-world AI applications such as Dropbox Dash.

AI & ML InfrastructurePerformance & Scaling
32
📰InfoQ Cloud·21h ago

Impact of LocalStack's Community Edition Discontinuation on Local AWS Development

LocalStack, a popular AWS cloud emulator for local development, has discontinued its free open-source Community Edition, moving to a single image that requires registration and introduces a credit-based system. This shift raises concerns among developers about the future of local AWS service emulation, highlighting the importance of resilient local development environments and the challenges of open-source project sustainability.

DevOps & SRECloud & Infrastructure
26
📰InfoQ Cloud·1d ago

Proactive Autoscaling for Latency-Sensitive Edge Applications in Kubernetes

This article discusses the limitations of Kubernetes Horizontal Pod Autoscaler (HPA) for dynamic, latency-sensitive edge workloads and proposes a custom autoscaler (CPA) solution. It highlights how HPA's reactive nature and rigid algorithm lead to inefficiencies at the edge, advocating for a more proactive, multi-signal approach incorporating CPU headroom, latency SLOs, and pod startup compensation to ensure stable performance and efficient resource utilization in constrained edge environments.

Performance & ScalingDistributed Systems
21
📰DZone Microservices·1d ago

Architecting Automated ML Pipelines with Amazon Q Developer

This article explores how Amazon Q Developer, a generative AI assistant, automates the architecture and deployment of machine learning (ML) infrastructure on AWS. It focuses on streamlining complex MLOps tasks like Infrastructure as Code (IaC) generation for GPU clusters, optimizing data engineering layers, and ensuring security and compliance, transforming the role of ML architects into high-level system designers.

AI & ML InfrastructureDevOps & SRE
30
🟠AWS Architecture Blog·1d ago

Designing Fine-Grained API Authorization with AWS Verified Permissions

This article details how Convera implemented a fine-grained API authorization system using Amazon Verified Permissions for their global cross-border payments platform. It highlights the architecture, policy definition using Cedar language, and integration with AWS services like Cognito and API Gateway to enforce attribute-based and role-based access control for both customer-facing and internal applications, as well as service-to-service communication.

SecurityAPI Design
20
🔹Azure Architecture Blog·1d ago

Azure Sovereign Cloud: Designing for Disconnected, Highly Regulated Environments

Microsoft's Sovereign Cloud offers a unique architecture for highly regulated, sensitive, and potentially disconnected environments. It extends Azure's governance and productivity capabilities, including support for large AI models, to on-premises deployments that can operate completely isolated from the public cloud. This approach emphasizes maintaining operational continuity, data sovereignty, and consistent management in challenging connectivity conditions.

Cloud & InfrastructureDistributed Systems
29
🔹Azure Architecture Blog·1d ago

Designing for Reliability, Resiliency, and Recoverability in Cloud Systems

This article clarifies the critical distinctions between reliability, resiliency, and recoverability in cloud system design, particularly within the Azure ecosystem. It emphasizes that reliability is the ultimate goal, achieved through deliberate architectural choices for resiliency to withstand disruptions and robust strategies for recoverability when limits are exceeded. Understanding these concepts is fundamental for making informed design trade-offs and building robust, highly available cloud applications.

Performance & ScalingCloud & Infrastructure
456
📌Pinterest Engineering·1d ago

Pinterest's Auto Memory Retries for Apache Spark OOMs

Pinterest engineered "Auto Memory Retries" to mitigate out-of-memory (OOM) errors in their large-scale Apache Spark deployment, enhancing resource efficiency and reliability. This system automatically identifies Spark tasks with high memory demands and retries them on executors with larger memory profiles, dynamically adjusting resource allocation. The solution involves extending core Spark classes to support task-level resource profiles and a hybrid retry strategy, showcasing a practical approach to optimizing distributed data processing.

Distributed SystemsPerformance & Scaling
336