Pinterest Engineering·December 10, 2025

Scaling Search Relevance Assessment with LLMs at Pinterest

Pinterest Engineering details their approach to scaling search relevance assessment for A/B experiments using fine-tuned Large Language Models (LLMs). This methodology addresses the limitations of human annotations by significantly reducing labeling costs, improving evaluation efficiency, and enabling more granular analysis of experimental results through stratified sampling designs. The system utilizes a cross-encoder LLM architecture to predict relevance across multiple languages, integrating various textual features to enhance prediction accuracy.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Pinterest Engineering

The Challenge of Search Relevance Evaluation

Evaluating search relevance is crucial for personalized search systems, ensuring that displayed content meets user needs. Traditionally, this relies on human annotations, which are expensive, time-consuming, and limit the scale and granularity of evaluations. These constraints often restrict A/B experiment sample sizes, making it difficult to detect small or heterogeneous treatment effects.

LLM-Powered Relevance Model Architecture

Pinterest developed a system where fine-tuned open-source LLMs act as a cross-encoder model to predict a Pin's relevance to a given query. This is framed as a 5-level multiclass classification problem (Highly Relevant to Highly Irrelevant), minimizing point-wise cross-entropy loss. To support multilingual search, they leveraged multilingual LLMs and integrated a comprehensive set of textual features for each Pin, including titles, descriptions, image captions, linked page data, and user-curated board titles.

Cross-encoder architecture for relevance prediction.
Fine-tuned multilingual LLMs (e.g., XLM-RoBERTa-large, Llama-3-8B) for classification.
Rich feature set for Pin representation: Pin titles/descriptions, BLIP image captions, linked page titles, board titles, engaged query tokens.

Stratified Sampling for Improved Experiment Sensitivity

The reduced cost and time of LLM labeling enabled a shift from simple random sampling (SRS) to a stratified query sampling design. This stratification, based on an in-house query-to-interest model and popularity segments, significantly reduces Minimum Detectable Effects (MDEs) by reducing variance. Optimal allocation is used to distribute sample units across strata, enhancing the ability to measure heterogeneous treatment effects and detect smaller impact changes in A/B tests.

💡

System Design Implication: A/B Testing Infrastructure

Integrating LLM-based evaluation into an A/B testing platform requires robust data pipelines for LLM inference, efficient storage of generated labels, and an experimentation framework capable of handling stratified sampling and complex metric aggregations like sDCG@K. The choice of LLM (e.g., XLM-RoBERTa-large vs. Llama-3-8B) also involves trade-offs between prediction quality, inference speed, and cost, directly impacting the operational efficiency of the A/B testing system.

LLMMachine LearningSearch RelevanceA/B TestingExperimentation PlatformData PipelineSamplingPerformance Optimization

Comments

Loading comments...

Architecture Design

Design this yourself

Design an experimentation platform for a large-scale e-commerce or content platform that leverages LLMs to automate and scale search relevance assessment. The platform should support stratified sampling to reduce MDEs, facilitate rapid evaluation of A/B experiments, and integrate both human and LLM-generated labels for metric calculation.

Focus: LLM-powered search relevance assessment and stratified sampling for A/B testing

Other design angles

· Design a data pipeline for an LLM-based search relevance system, focusing on data ingestion, feature engineering, model inference, and storing relevance scores for analytical purposes.· Design the core ML architecture for an LLM-powered cross-encoder model for search relevance, considering multilingual support, feature representation, and training/fine-tuning strategies.· Architect a real-time A/B testing system that integrates automated metric calculation using LLM predictions and provides insights into heterogeneous treatment effects across different user segments.