Pinterest Engineering·April 13, 2026

Scaling Recommendation Systems with Request-Level Deduplication at Pinterest

This article details Pinterest's approach to scaling large-scale recommendation systems by implementing request-level deduplication across the entire ML lifecycle. It highlights how techniques like optimized data storage, synchronized batch normalization, user-level masking, and a custom transformer architecture (DCAT) enabled significant cost savings and performance gains in storage, training, and serving while deploying significantly larger models.

Performance & Scaling AI & ML Infrastructure Distributed Systems

Read original on Pinterest Engineering

Pinterest faced substantial infrastructure pressure when scaling its recommendation models, specifically a 100x increase in transformer dense parameters. To prevent proportional growth in storage, training, and serving costs, they implemented request-level deduplication. This family of techniques ensures that request-level data (especially user sequence data) is processed and stored once, rather than redundantly for every item scored within a single user request.

The Challenge of Redundant Request-Level Data

Recommendation funnels involve retrieving a large set of items and then ranking them. Crucially, the same massive user data (e.g., 16K tokens encoding user actions) flows through every stage and is duplicated across hundreds or thousands of items scored per request. This redundancy leads to:

Massive Data Footprint: User sequences are stored identically for every candidate item scored, creating a huge storage burden.
Expensive Processing: Computation for user tower models in retrieval and user sequence understanding in ranking is a significant proportion of total compute.

Deduplication Strategies Across the ML Lifecycle

Pinterest applied request-level deduplication across storage, training, and serving:

Storage Optimization (Apache Iceberg): By leveraging Apache Iceberg with user ID and request ID based sorting, rows sharing the same request are co-located. Columnar compression algorithms then automatically deduplicate the redundant user sequence data, achieving 10-50x compression on user-heavy feature columns. This also enables efficient bucket joins, backfills, incremental feature engineering, and stratified sampling.
Training Correctness & Speedups: Implementing Synchronized Batch Normalization (SyncBatchNorm) addressed the IID (Independent and Identically Distributed) disruption caused by request-sorted batches, restoring stable statistics. User-level masking prevented false negatives in InfoNCE loss by ensuring in-batch negatives belong to different users. For retrieval models, user tower computation runs once per unique request. For ranking, a custom Deduplicated Cross-Attention Transformer (DCAT) separates context computation (once per user sequence) from item-specific cross-attention.
Serving Throughput Gains: The DCAT architecture provides the same deduplication benefits at serving time, processing the user's action sequence once and caching intermediate representations for reuse across all candidate items. This resulted in a 7x increase in ranking serving throughput, enabling the deployment of 100x larger models without proportional cost increases.

💡

Key Takeaways for System Designers

Request-level deduplication is a powerful, cross-cutting technique for ML systems, offering simultaneous improvements in storage efficiency, training speed, and serving throughput. Simple architectural fixes like SyncBatchNorm and user-level masking can unlock significant gains. The impact of such optimizations compounds across the entire ML stack.

deduplicationrecommendation systemsmachine learningscalingapache icebergtransformerbatch normalizationsystem design

Comments

Loading comments...

Architecture Design

Design this yourself

Design a high-scale, low-latency recommendation system capable of serving large transformer models, focusing on how you would implement request-level deduplication across data storage, model training pipelines, and real-time inference to optimize costs and performance. Describe the architectural components and trade-offs for handling massive user sequence data efficiently.

Practice Interview

Focus: request-level deduplication for ML recommendation systems, including optimized storage, training (SyncBatchNorm, user-level masking, DCAT), and serving

Other design angles

· Design a distributed ML training platform that supports efficient processing of highly correlated data batches, specifically addressing issues like IID disruption and false negatives in contrastive learning.· Architect the data lake and feature store for a large-scale recommendation system, detailing how to achieve significant storage compression and optimize data access patterns for ML training and serving through techniques like request-level sorting and deduplication.· Design the online serving infrastructure for a transformer-based ranking model, explaining how to maximize throughput and minimize latency through user sequence caching and a custom cross-attention architecture to leverage request-level deduplication.

Scaling Recommendation Systems with Request-Level Deduplication at Pinterest

The Challenge of Redundant Request-Level Data

Deduplication Strategies Across the ML Lifecycle

Comments

Architecture Design

Related Lessons