This article details Pinterest's approach to scaling large-scale recommendation systems by implementing request-level deduplication across the entire ML lifecycle. It highlights how techniques like optimized data storage, synchronized batch normalization, user-level masking, and a custom transformer architecture (DCAT) enabled significant cost savings and performance gains in storage, training, and serving while deploying significantly larger models.
Read original on Pinterest EngineeringPinterest faced substantial infrastructure pressure when scaling its recommendation models, specifically a 100x increase in transformer dense parameters. To prevent proportional growth in storage, training, and serving costs, they implemented request-level deduplication. This family of techniques ensures that request-level data (especially user sequence data) is processed and stored once, rather than redundantly for every item scored within a single user request.
Recommendation funnels involve retrieving a large set of items and then ranking them. Crucially, the same massive user data (e.g., 16K tokens encoding user actions) flows through every stage and is duplicated across hundreds or thousands of items scored per request. This redundancy leads to:
Pinterest applied request-level deduplication across storage, training, and serving:
Key Takeaways for System Designers
Request-level deduplication is a powerful, cross-cutting technique for ML systems, offering simultaneous improvements in storage efficiency, training speed, and serving throughput. Simple architectural fixes like SyncBatchNorm and user-level masking can unlock significant gains. The impact of such optimizations compounds across the entire ML stack.