Menu
Pinterest Engineering·February 2, 2026

Re-architecting Pinterest's Ads Serving Stack for Next-Gen Ranking Models

Pinterest re-architected their ad serving stack to support more expressive, GPU-based lightweight ranking models, moving beyond the limitations of the Two-Tower architecture. This involved significant optimizations across feature fetching, business logic execution, GPU inference, and data flow to maintain end-to-end latency despite increased model complexity. The re-architecture highlights critical trade-offs and techniques for integrating complex ML models into high-scale, low-latency serving systems.

Read original on Pinterest Engineering

The article details Pinterest's journey to overcome the limitations of the traditional Two-Tower model architecture for lightweight ad ranking. While efficient, Two-Tower models struggle with interaction features and advanced architectural patterns like target attention or early feature crossing, which are crucial for higher quality recommendations. To address this, Pinterest decided to integrate more complex, general-purpose neural networks requiring dedicated GPU-based inference. The core challenge was to introduce this computationally heavier stage into their highly optimized serving stack without increasing end-to-end latency.

Architectural Shift: From CPU-centric to GPU-accelerated Ranking

The existing serving funnel involved feature expansion, retrieval & lightweight ranking (Two-Tower dot product on CPU), and downstream heavy ranking. Simply inserting a GPU inference box would cause severe latency bottlenecks due to data volume, serialization, and network transfer. The re-architecture required a holistic approach, focusing on multiple optimization fronts to make the integration feasible at Pinterest's scale.

Key Optimizations for Latency and Throughput

  • <b>Feature Fetching Dilemma:</b> For high-value, frequently accessed documents (Segment 1), features were bundled directly into the PyTorch model file as registered buffers, residing on GPU HBM to eliminate network I/O and host-to-GPU transfers. For the long tail (Segment 2), a high-performance KV store with in-host caching was used.
  • <b>Moving Business Logic into the Model:</b> Utility calculations, diversity rules, and top-k sorting were moved from CPU to the GPU within the PyTorch model. This drastically reduced device-to-host data transmission by only returning final 'winners' (O(1K) documents from O(100K) inputs) and parallelized execution.
  • <b>Taming GPU Inference:</b> Initial 4000ms p90 latency was reduced to 20ms through: Multi-Stream CUDA for overlapping operations, worker alignment to physical CPU cores to avoid context switching, Kernel Fusion via Triton for reduced memory bandwidth pressure, and BF16 precision for faster math and lower memory footprint.
  • <b>Rethinking Retrieval Data Flow:</b> The legacy row-wise, heavy Thrift metadata structure was replaced with a column-wise, lightweight structure for initial retrieval (Phase 1), fetching only essential IDs and Bids. Heavy metadata for the final O(1K) top documents was fetched lazily (Phase 2) and in parallel, with 1/3 of fields deprecated and another 1/3 moved to later stages, reducing metadata size by 3x.
  • <b>Graph Execution & Targeting:</b> Feature expansion was split into parallel paths (targeting-only features vs. full features) allowing targeting and filtering to start earlier and overlap with heavier feature fetching, shaving off 10ms.
ℹ️

Impact of Local vs. Global Ranking

A crucial learning during A/B experiments was the subtle shift from 'Local Ranking' (aggregating locally-ranked winners from partitioned document sets) in the old system to 'Global Ranking' (processing all eligible documents in a single batch) in the new GPU-based system. While theoretically superior, global ranking caused a 'distribution shift' in candidate sets, leading to unexpected online metric changes. This highlighted the importance of analyzing not just performance, but also the qualitative impact of architectural changes on content distribution and user experience.

This re-architecture demonstrates that integrating advanced ML models at scale requires a deep understanding of the entire serving pipeline, from feature management and data transfer to GPU utilization and business logic placement. It emphasizes the need for close collaboration between modeling and infrastructure teams to identify and optimize bottlenecks holistically, rather than focusing solely on model performance.

Machine Learning ServingGPU InferenceLow LatencyRecommendation SystemsFeature EngineeringSystem Re-architectureCUDAPinterest Engineering

Comments

Loading comments...