Meta Engineering·July 1, 2026

Meta's AI Storage Architecture for Maximizing GPU Utilization and Research Velocity

This article details Meta's evolution of its BLOB storage architecture to address the unique challenges of large-scale AI workloads, focusing on maximizing GPU utilization and accelerating AI research velocity. It discusses a foundational redesign that includes a unified metadata schema, elimination of data plane proxies, and regional deployments to achieve low-latency and high-throughput data access. The article also covers solutions for handling data spikes and hot spots, along with protocol optimizations.

Databases & Storage Performance & Scaling AI & ML Infrastructure

Read original on Meta Engineering

The Challenge: AI Workloads and Storage Bottlenecks

AI model training and development are increasingly bottlenecked by storage and interconnect performance, despite rapid advancements in compute power (GPUs). Traditional storage architectures, often optimized for cost-per-byte and global replication, introduce significant latencies and inefficiencies for data-hungry AI workloads. These bottlenecks lead to GPU stalls, directly impacting computational cost and time-to-market for AI innovations, and hinder research velocity by requiring extensive data movement for geo-distributed GPUs.

Legacy BLOB Storage Architecture Limitations

Layered Metadata Lookups: The previous `getObject` API involved multiple stateful metadata lookups across several layers (namelayer, volumeslayer, containerlayer). These lookups could cross regions, resulting in hundreds of milliseconds of latency, which is unacceptable for AI workloads requiring millisecond access.
Dataplane Proxy: All data requests were proxied through API servers, adding latency and consuming power, which is a critical constraint in GPU-heavy data centers.
Global Replication: Data and metadata were globally replicated by default, optimizing for high durability and availability against region outages, but not suitable for AI's regional data locality needs.
HDD Optimization: Built for HDDs and optimized for cost per byte, the architecture was ill-suited for the IOPS demands of flash-based AI workloads, where computational cost of storage is negligible compared to GPUs.

Rebuilding the Foundation: Key Design Choices

Meta rebuilt its BLOB storage foundation to directly address the AI workload requirements, making several crucial design choices:

Unified Metadata Schema: Consolidated disparate metadata stores into a single, flat schema backed by ZippyDB. This enables O(1) lookup complexity for resolving paths to storage addresses, significantly reducing metadata access latency.
Elimination of Dataplane Proxy: A "fat client" SDK was developed to stream bytes directly from Tectonic storage servers to clients. This reduces latency, increases throughput, and improves power efficiency by offloading proxy servers.
Regional Deployment: The BLOB-storage stack is now deployed regionally and co-located with GPUs. This aligns with the data locality requirements of AI training jobs, minimizing cross-region data transfers and improving performance and research velocity.

Optimizations for Spikes, Hot Spots, and Tail Latencies

Distributed Data Cache: Leverages spare memory on GPU hosts as a distributed cache for frequently accessed data, integrating with Meta's Owl subsystem. This absorbs traffic spikes, reduces I/O load on storage, and improves p50/p99 latencies.
Readplan Metadata Cache: Caches the mapping from paths to storage addresses in a distributed memory store (similar to memcache). This provides 1-2ms access to metadata and mitigates metadata hot shards.
Hedged Reads: Implements hedged reads on the client side to combat tail latencies caused by slow storage nodes.
Dynamic Concurrency Control: The client SDK dynamically tunes parallelism based on application-level congestion signals to prevent egress spikes during checkpointing, which can lead to congestion, timeouts, and GPU stalls.

💡

Key System Design Takeaway

The transition from a globally replicated, HDD-optimized architecture to a regionally deployed, flash-optimized one with a simplified data path and robust caching mechanisms highlights a crucial trade-off: sacrificing global-by-default durability for localized performance and reduced latency, which is essential for modern AI workloads.

storageblob storageAI/MLGPUlow latencyhigh throughputmetadatacaching

Comments

Loading comments...

Architecture Design

Design this yourself

Design a high-performance, low-latency, and cost-efficient distributed BLOB storage system optimized for large-scale AI model training and inference workloads. The system must support massive datasets, maximize GPU utilization by minimizing I/O stalls, and accelerate AI research velocity. Include a unified metadata layer, direct client-to-storage data streaming, regional deployments, distributed caching for hot data and metadata, and mechanisms to handle traffic spikes and tail latencies.

Practice Interview

Other design angles

· Design a specialized metadata service for an AI storage system, focusing on achieving O(1) lookup complexity for object-to-block mapping and handling high query volumes.· Design a distributed caching layer that leverages idle GPU host memory to accelerate data access for AI training jobs, considering cache coherence, eviction policies, and network efficiency.· Design an end-to-end data pipeline for AI researchers to ingest, prepare, and access large datasets across geo-distributed GPU clusters, balancing data locality, consistency, and iteration speed.