Menu
Pinterest Engineering·April 13, 2026

Scaling Recommendation Systems with Request-Level Deduplication at Pinterest

This article details Pinterest's approach to scaling large-scale recommendation systems by implementing request-level deduplication across the entire ML lifecycle. It highlights how techniques like optimized data storage, synchronized batch normalization, user-level masking, and a custom transformer architecture (DCAT) enabled significant cost savings and performance gains in storage, training, and serving while deploying significantly larger models.

Read original on Pinterest Engineering

Pinterest faced substantial infrastructure pressure when scaling its recommendation models, specifically a 100x increase in transformer dense parameters. To prevent proportional growth in storage, training, and serving costs, they implemented request-level deduplication. This family of techniques ensures that request-level data (especially user sequence data) is processed and stored once, rather than redundantly for every item scored within a single user request.

The Challenge of Redundant Request-Level Data

Recommendation funnels involve retrieving a large set of items and then ranking them. Crucially, the same massive user data (e.g., 16K tokens encoding user actions) flows through every stage and is duplicated across hundreds or thousands of items scored per request. This redundancy leads to:

  • Massive Data Footprint: User sequences are stored identically for every candidate item scored, creating a huge storage burden.
  • Expensive Processing: Computation for user tower models in retrieval and user sequence understanding in ranking is a significant proportion of total compute.

Deduplication Strategies Across the ML Lifecycle

Pinterest applied request-level deduplication across storage, training, and serving:

  • Storage Optimization (Apache Iceberg): By leveraging Apache Iceberg with user ID and request ID based sorting, rows sharing the same request are co-located. Columnar compression algorithms then automatically deduplicate the redundant user sequence data, achieving 10-50x compression on user-heavy feature columns. This also enables efficient bucket joins, backfills, incremental feature engineering, and stratified sampling.
  • Training Correctness & Speedups: Implementing Synchronized Batch Normalization (SyncBatchNorm) addressed the IID (Independent and Identically Distributed) disruption caused by request-sorted batches, restoring stable statistics. User-level masking prevented false negatives in InfoNCE loss by ensuring in-batch negatives belong to different users. For retrieval models, user tower computation runs once per unique request. For ranking, a custom Deduplicated Cross-Attention Transformer (DCAT) separates context computation (once per user sequence) from item-specific cross-attention.
  • Serving Throughput Gains: The DCAT architecture provides the same deduplication benefits at serving time, processing the user's action sequence once and caching intermediate representations for reuse across all candidate items. This resulted in a 7x increase in ranking serving throughput, enabling the deployment of 100x larger models without proportional cost increases.
💡

Key Takeaways for System Designers

Request-level deduplication is a powerful, cross-cutting technique for ML systems, offering simultaneous improvements in storage efficiency, training speed, and serving throughput. Simple architectural fixes like SyncBatchNorm and user-level masking can unlock significant gains. The impact of such optimizations compounds across the entire ML stack.

deduplicationrecommendation systemsmachine learningscalingapache icebergtransformerbatch normalizationsystem design

Comments

Loading comments...