Latest curated articles from top engineering blogs
45 articles
This article introduces agentic cloud operations, a new paradigm for managing complex cloud environments using AI-powered agents. It highlights how these agents can automate and optimize various operational tasks across the cloud lifecycle, from migration and deployment to optimization and troubleshooting, ensuring continuous improvement and adaptability.
This article discusses Cloudflare's project, Vinext, a re-implementation of the Next.js API surface directly on Vite, aimed at improving deployment to serverless platforms like Cloudflare Workers. It highlights architectural challenges with traditional Next.js deployments in serverless environments and proposes a new approach leveraging Vite's ecosystem and AI for rapid development and optimized performance.
This article details Amazon Key's migration from a tightly coupled monolithic system to a resilient event-driven architecture using Amazon EventBridge. It highlights the challenges of the legacy system, including service coupling and inconsistent event management, and presents the design of a modern solution focusing on schema governance, client-side validation, and efficient multi-service integration.
This article outlines the architecture and deployment of a highly available and secure shared file storage solution using Azure Files for geographically dispersed corporate offices. It emphasizes balancing performance with security, leveraging Azure's Zone-Redundant Storage (ZRS) for resilience, snapshots for data integrity, and Virtual Networks for zero-trust access control.
This article from Dropbox Tech explores low-bit inference techniques, specifically quantization, as a critical strategy for making large AI models more efficient, faster, and cheaper to run in production. It delves into how reducing numerical precision impacts memory, compute, and energy, and the architectural considerations for deploying these optimized models on modern hardware like GPUs, addressing latency and throughput constraints for real-world AI applications such as Dropbox Dash.
LocalStack, a popular AWS cloud emulator for local development, has discontinued its free open-source Community Edition, moving to a single image that requires registration and introduces a credit-based system. This shift raises concerns among developers about the future of local AWS service emulation, highlighting the importance of resilient local development environments and the challenges of open-source project sustainability.
This article discusses the limitations of Kubernetes Horizontal Pod Autoscaler (HPA) for dynamic, latency-sensitive edge workloads and proposes a custom autoscaler (CPA) solution. It highlights how HPA's reactive nature and rigid algorithm lead to inefficiencies at the edge, advocating for a more proactive, multi-signal approach incorporating CPU headroom, latency SLOs, and pod startup compensation to ensure stable performance and efficient resource utilization in constrained edge environments.
This article explores how Amazon Q Developer, a generative AI assistant, automates the architecture and deployment of machine learning (ML) infrastructure on AWS. It focuses on streamlining complex MLOps tasks like Infrastructure as Code (IaC) generation for GPU clusters, optimizing data engineering layers, and ensuring security and compliance, transforming the role of ML architects into high-level system designers.
This article details how Convera implemented a fine-grained API authorization system using Amazon Verified Permissions for their global cross-border payments platform. It highlights the architecture, policy definition using Cedar language, and integration with AWS services like Cognito and API Gateway to enforce attribute-based and role-based access control for both customer-facing and internal applications, as well as service-to-service communication.
Microsoft's Sovereign Cloud offers a unique architecture for highly regulated, sensitive, and potentially disconnected environments. It extends Azure's governance and productivity capabilities, including support for large AI models, to on-premises deployments that can operate completely isolated from the public cloud. This approach emphasizes maintaining operational continuity, data sovereignty, and consistent management in challenging connectivity conditions.
This article clarifies the critical distinctions between reliability, resiliency, and recoverability in cloud system design, particularly within the Azure ecosystem. It emphasizes that reliability is the ultimate goal, achieved through deliberate architectural choices for resiliency to withstand disruptions and robust strategies for recoverability when limits are exceeded. Understanding these concepts is fundamental for making informed design trade-offs and building robust, highly available cloud applications.
Pinterest engineered "Auto Memory Retries" to mitigate out-of-memory (OOM) errors in their large-scale Apache Spark deployment, enhancing resource efficiency and reliability. This system automatically identifies Spark tasks with high memory demands and retries them on executors with larger memory profiles, dynamically adjusting resource allocation. The solution involves extending core Spark classes to support task-level resource profiles and a hybrid retry strategy, showcasing a practical approach to optimizing distributed data processing.