Software Architecture and System Design News

Latest curated articles from top engineering blogs

Netflix

Uber

Meta's Petabyte-Scale Data Ingestion Migration

Meta successfully migrated its petabyte-scale MySQL social graph data ingestion platform to a centralized, self-managed warehouse service, significantly improving reliability and operational efficiency. The transition involved techniques like staged migrations, reverse shadowing, and continuous checksum monitoring to ensure zero downtime and data consistency for thousands of pipelines supporting analytics and machine learning workloads.

Databases & StorageDistributed Systems

10630

Medium #system-design·1d ago

Choosing Databases Based on Core Data Structures

This article highlights that effective database selection should be driven by understanding the underlying data structures and their operational characteristics rather than marketing hype. It emphasizes that databases are essentially optimized implementations of fundamental data structures, influencing their performance, scalability, and suitability for various use cases.

Databases & StorageDistributed Systems

1339801

Dev.to #architecture·1d ago

Avoiding Kafka Over-Reliance: Lessons from a Treasure Hunt Engine

This article details the architectural evolution of a treasure hunt engine that initially struggled due to an over-reliance on Kafka for all event processing. It highlights the challenges of using a single Kafka topic for diverse events, leading to bottlenecks and consistency issues. The solution involved introducing an event store (EventStoreDB) to decouple event production from consumption, improving performance, reliability, and auditability.

Distributed SystemsDatabases & Storage

1529773

Cloudflare Blog·2d ago

Building Cloudflare's Unified Data Lakehouse and AI Data Agent

Cloudflare tackled data sprawl by creating Town Lake, a unified data lakehouse built on Apache Trino and Iceberg on R2, providing a single SQL interface for diverse data sources. They also developed Skipper, an AI data agent for natural language querying, emphasizing governed access, PII detection, and Cloudflare's own platform services for infrastructure. This architecture addresses challenges like disparate data systems, sampling issues, and tribal knowledge, enabling comprehensive and secure data insights.

Databases & StorageDistributed Systems

1217121

Dev.to #architecture·3d ago

Scaling Real-Time Treasure Hunts: Solving Unbounded State with a Two-Tier Architecture

This article details Veltrix's architectural evolution to support 20,000 concurrent players in a real-time treasure hunt, focusing on overcoming unbounded state issues from long-lived WebSocket connections. The solution involved a two-tier architecture, separating ephemeral WebSocket handling from stateful processing using Rust, Kafka, and RocksDB to significantly reduce memory footprint and improve stability.

Distributed SystemsPerformance & Scaling

1077165

ByteByteGo·3d ago

Airtable's Semantic Search Architecture for AI Features

Airtable engineered a scalable and performant semantic search system to power its AI features, focusing on handling diverse customer database sizes and multi-tenancy. The architecture leverages Milvus for vector storage and search, with critical design decisions made around data partitioning, index selection, and managing hot/cold data to meet strict latency, throughput, and privacy requirements.

Databases & StorageDistributed Systems

1336993

The New Stack·3d ago

Snowflake's Strategic Cloud Infrastructure Investment for AI Expansion

Snowflake's $6 billion commitment to AWS for Graviton and GPU instances signals a major strategic shift towards AI, focusing on leveraging cost-efficient compute for data warehousing and high-performance resources for AI model training and inference. This investment highlights critical architectural considerations for large-scale data platforms expanding into AI, particularly around cloud vendor strategy, infrastructure cost optimization, and data residency.

Cloud & InfrastructureAI & ML Infrastructure

1106934

Dev.to #architecture·3d ago

Rethinking Event Sourcing: Consolidating Events and Aggregates in PostgreSQL

This article presents a crucial system design lesson learned from a CQRS implementation where events and aggregate roots were stored in separate systems (Kafka and PostgreSQL). The initial distributed architecture led to severe performance issues and operational overhead. The authors describe their journey to consolidate events and aggregates into a single PostgreSQL database, leveraging logical replication as an event bus, dramatically improving latency and reducing costs.

Distributed SystemsDatabases & Storage

1206899

Dev.to #architecture·3d ago

Application-Level Envelope Encryption for SOC 2 Compliance

This article details an architectural strategy for implementing application-level envelope encryption to achieve robust data security and SOC 2 compliance, moving beyond basic RBAC and database encryption. It outlines a hybrid cryptographic solution using AES for content and RSA for key wrapping, and presents the data modeling and service contracts necessary for a Symfony application. The focus is on cryptographic isolation at the record level and secure handling of encryption keys.

SecurityDistributed Systems

1439464

DZone Microservices·4d ago

Liquid Clustering: An Adaptive Data Layout for Delta Lake

This article explores Databricks Liquid Clustering, a data layout strategy in Delta Lake 3.0 that replaces traditional partitioning and Z-Ordering. It introduces a self-tuning, flexible approach to organizing data, particularly for Unity Catalog managed tables, to improve query performance and reduce maintenance overhead. The core idea is to dynamically cluster data based on specified keys, adapting to evolving query patterns without rigid partitions or costly data rewrites.

Databases & StoragePerformance & Scaling

1538920

Dev.to #architecture·5d ago

Scaling an In-Memory Metadata Layer: Lessons from Veltrix Feature Store

This article details the challenges and solutions encountered while scaling an in-memory metadata layer for the Veltrix feature store, highlighting critical performance bottlenecks related to garbage collection and disk I/O with RocksDB. It presents a successful architectural pivot to a custom mmap-based sharded hash map, showcasing specific optimizations for latency, memory management, and NUMA awareness to achieve high throughput and low latency.

Performance & ScalingDistributed Systems

16810408

ByteByteGo·5d ago

Building Vector Indexing in a Distributed SQL Database: CockroachDB's C-SPANN

This article details how CockroachDB integrated vector indexing directly into its distributed SQL database by developing C-SPANN. It highlights the architectural constraints faced by a distributed, transactional database when adding a new feature like vector search, emphasizing the need for no central coordinator, real-time updates, sharding compatibility, and hot spot avoidance. The solution treats the vector index as ordinary table data, leveraging CockroachDB's existing distributed mechanisms for scalability and reliability.

Databases & StorageDistributed Systems

14910589