Pinterest Engineering·May 8, 2026

Architecting Real-Time Context Integration in Sequential Recommender Systems

This article details Pinterest's approach to integrating real-time online context into their sequential ad recommender models to enhance relevance. It highlights the architectural changes, a novel training method using synthetic data, and a hybrid online/offline inference serving flow. The solution significantly improved ad relevance and conversion metrics by dynamically incorporating user's immediate intent.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Pinterest Engineering

The Challenge of Real-Time Context in Recommender Systems

Traditional sequential recommender models, while effective in leveraging historical user behavior, often lack the ability to incorporate real-time, online context. This limitation is critical for surfaces where immediate user intent is paramount, such as 'Related Pins' or 'Search' on Pinterest. Without understanding what a user is currently viewing or searching, recommendations can fall short in relevance, leading to poor user experience and lower engagement. The article describes how their initial Transformer-based model, relying solely on offline historical data, struggled to perform on these highly contextual surfaces, necessitating an architectural evolution.

Contextual Sequential Two-Tower Model Architecture

To address the context gap, Pinterest evolved its two-tower model by integrating a context layer directly into the query tower. This architectural change allows the model to concatenate the output of the historical Transformer encoder with real-time context features. The combined representation is then fed into a Multi-Layer Perceptron (MLP) to generate a dynamic user embedding. For 'Related Pins', context features are derived from the currently viewed Pin, enhancing personalization with additional user demographic embeddings.

Hybrid Inference for Dynamic User Embeddings

A critical system design aspect is the hybrid user embedding inference approach. Since context features are only available at request time (online), the system splits the computation:

Offline Inference: The Transformer encoder, processing historical sequences, runs daily. Its last hidden state (encoded historical user signals) is pre-computed and stored in a feature store, providing a foundation for user representation.
Online Inference: The context layer and final MLP head are computed in real-time at serving time. This step takes the pre-computed offline signal and combines it with live context features (e.g., the subject Pin's features), creating a dynamic, context-aware user embedding.

💡

System Design Takeaway: Hybrid Architectures

This hybrid approach exemplifies a common pattern in large-scale machine learning systems where combining offline batch processing (for stability and efficiency) with online real-time computation (for freshness and responsiveness) can yield significant performance gains and address specific data availability challenges. It allows for leveraging deep historical insights while remaining agile enough to react to immediate user signals.

Training with Synthetic Context Data

A notable challenge was enabling the model to learn from real-time context during offline training, as this data isn't available until serving. Pinterest solved this by using synthetic augmented data. Pseudo-context derived from positive conversion events is injected into the input sequence during training, encouraging the model to retrieve items semantically related to the session's context. A high dropout rate in the context layer during training ensures the model doesn't over-rely on synthetic context and still leverages historical sequences.

recommender systemsmachine learningreal-time processingtwo-tower modelhybrid architecturefeature engineeringad techpersonalization