Dropbox Dash utilizes a RAG pattern for enterprise search, where the quality of search ranking is critical for accurate AI responses. This article details an architecture for improving search relevance models by combining human labeling with LLM-assisted evaluation. The approach leverages LLMs as a force multiplier for generating high-quality training data, addressing the scalability and consistency challenges of purely human-driven annotation.
Read original on Dropbox TechDropbox Dash's AI-powered search operates on a Retrieval-Augmented Generation (RAG) pattern. This means it first retrieves relevant company information using an enterprise search index and then uses a Large Language Model (LLM) to generate a response. The effectiveness of this system heavily relies on the quality of the search ranking model, which determines the most relevant documents to pass to the LLM.
Modern search ranking models are machine learning-trained, not hand-tuned. They learn from query-document pairs annotated with human relevance judgments (typically on a 1-5 scale). The core challenge is generating a sufficient volume of high-quality, consistent relevance labels, especially in enterprises with millions or billions of documents. Traditional human labeling is expensive, slow, and can struggle with sensitive data or diverse content types.
To overcome the limitations of human labeling, Dash employs LLMs to amplify the process. Instead of replacing traditional ranking models at query time (due to latency and context window constraints), LLMs are used offline to generate vast amounts of training data. A small, high-quality human-labeled dataset is used to tune the LLM's prompts and model parameters. Once validated, the LLM generates hundreds of thousands or millions of relevance labels, effectively acting as a 'teacher' for smaller, production-scale relevance models.
LLM for Offline Data Generation
Using LLMs for offline data generation (synthesizing training data) rather than direct online inference is a powerful pattern for integrating LLMs into systems where real-time performance is critical. This decouples the expensive LLM inference from the low-latency requirements of the production system.
Improving LLM accuracy involves an iterative evaluation loop: measure performance against human judgments (using metrics like Mean Squared Error), adjust prompts or models, and remeasure. Dash focuses evaluation on cases where mistakes are more likely, identified by discrepancies between user behavior (clicks, skips) and LLM predictions. Furthermore, LLMs are provided with 'tools' to research query context (internal terminology, acronyms) to make more accurate, context-aware relevance judgments, mimicking how human evaluators would resolve ambiguity.