📦Dropbox Tech·February 26, 2026

Scaling Enterprise Search Relevance with LLM-Assisted Labeling

Dropbox Dash utilizes a RAG pattern for enterprise search, where the quality of search ranking is critical for accurate AI responses. This article details an architecture for improving search relevance models by combining human labeling with LLM-assisted evaluation. The approach leverages LLMs as a force multiplier for generating high-quality training data, addressing the scalability and consistency challenges of purely human-driven annotation.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Dropbox Tech

Dropbox Dash's AI-powered search operates on a Retrieval-Augmented Generation (RAG) pattern. This means it first retrieves relevant company information using an enterprise search index and then uses a Large Language Model (LLM) to generate a response. The effectiveness of this system heavily relies on the quality of the search ranking model, which determines the most relevant documents to pass to the LLM.

The Challenge of Training Search Ranking Models

Modern search ranking models are machine learning-trained, not hand-tuned. They learn from query-document pairs annotated with human relevance judgments (typically on a 1-5 scale). The core challenge is generating a sufficient volume of high-quality, consistent relevance labels, especially in enterprises with millions or billions of documents. Traditional human labeling is expensive, slow, and can struggle with sensitive data or diverse content types.

LLMs as a Force Multiplier for Relevance Labeling

To overcome the limitations of human labeling, Dash employs LLMs to amplify the process. Instead of replacing traditional ranking models at query time (due to latency and context window constraints), LLMs are used offline to generate vast amounts of training data. A small, high-quality human-labeled dataset is used to tune the LLM's prompts and model parameters. Once validated, the LLM generates hundreds of thousands or millions of relevance labels, effectively acting as a 'teacher' for smaller, production-scale relevance models.

💡

LLM for Offline Data Generation

Using LLMs for offline data generation (synthesizing training data) rather than direct online inference is a powerful pattern for integrating LLMs into systems where real-time performance is critical. This decouples the expensive LLM inference from the low-latency requirements of the production system.

Iterative Evaluation and Contextual Enrichment

Improving LLM accuracy involves an iterative evaluation loop: measure performance against human judgments (using metrics like Mean Squared Error), adjust prompts or models, and remeasure. Dash focuses evaluation on cases where mistakes are more likely, identified by discrepancies between user behavior (clicks, skips) and LLM predictions. Furthermore, LLMs are provided with 'tools' to research query context (internal terminology, acronyms) to make more accurate, context-aware relevance judgments, mimicking how human evaluators would resolve ambiguity.

LLMssearch relevanceRAGmachine learningdata labelingenterprise searchAI systemssystem architecture

Comments

Loading comments...

Architecture Design

Design this yourself

Design an enterprise search system for a large organization (millions of documents, thousands of users) that leverages a RAG pattern. Focus on the architecture of the search relevance model training pipeline, specifically how to combine human and LLM-assisted labeling to generate high-quality, scalable training data. Detail the feedback loops, evaluation mechanisms, and how to incorporate contextual understanding for improved relevance.

Focus: scalable relevance labeling system for enterprise search using LLM-assisted human feedback

Other design angles

· Design a standalone service for generating high-quality relevance labels for a generic search engine, explaining the integration points for human review and LLM prompting.· Architect a multi-tenant RAG-based knowledge retrieval system, considering how relevance labeling would need to be adapted for different tenants' data and organizational contexts.· Design the real-time serving infrastructure for the search relevance model, assuming it has been trained offline using the LLM-assisted labeling approach described.