Pinterest Engineering details their approach to scaling search relevance assessment for A/B experiments using fine-tuned Large Language Models (LLMs). This methodology addresses the limitations of human annotations by significantly reducing labeling costs, improving evaluation efficiency, and enabling more granular analysis of experimental results through stratified sampling designs. The system utilizes a cross-encoder LLM architecture to predict relevance across multiple languages, integrating various textual features to enhance prediction accuracy.
Read original on Pinterest EngineeringEvaluating search relevance is crucial for personalized search systems, ensuring that displayed content meets user needs. Traditionally, this relies on human annotations, which are expensive, time-consuming, and limit the scale and granularity of evaluations. These constraints often restrict A/B experiment sample sizes, making it difficult to detect small or heterogeneous treatment effects.
Pinterest developed a system where fine-tuned open-source LLMs act as a cross-encoder model to predict a Pin's relevance to a given query. This is framed as a 5-level multiclass classification problem (Highly Relevant to Highly Irrelevant), minimizing point-wise cross-entropy loss. To support multilingual search, they leveraged multilingual LLMs and integrated a comprehensive set of textual features for each Pin, including titles, descriptions, image captions, linked page data, and user-curated board titles.
The reduced cost and time of LLM labeling enabled a shift from simple random sampling (SRS) to a stratified query sampling design. This stratification, based on an in-house query-to-interest model and popularity segments, significantly reduces Minimum Detectable Effects (MDEs) by reducing variance. Optimal allocation is used to distribute sample units across strata, enhancing the ability to measure heterogeneous treatment effects and detect smaller impact changes in A/B tests.
System Design Implication: A/B Testing Infrastructure
Integrating LLM-based evaluation into an A/B testing platform requires robust data pipelines for LLM inference, efficient storage of generated labels, and an experimentation framework capable of handling stratified sampling and complex metric aggregations like sDCG@K. The choice of LLM (e.g., XLM-RoBERTa-large vs. Llama-3-8B) also involves trade-offs between prediction quality, inference speed, and cost, directly impacting the operational efficiency of the A/B testing system.