Dropbox engineers have developed a method using Large Language Models (LLMs) to significantly improve the quality and scale of data labeling for Retrieval-Augmented Generation (RAG) systems. This approach addresses the bottleneck of document retrieval quality in RAG by amplifying human judgment for training search ranking models. It offers a scalable and cost-effective solution for enterprises dealing with vast document repositories, combining automated LLM labeling with human calibration and oversight.
Read original on InfoQ ArchitectureRetrieval-Augmented Generation (RAG) systems are critical for generating relevant responses from vast document repositories. A core bottleneck in these systems is the quality of document retrieval, specifically how well the system identifies and ranks relevant content before passing it to an LLM. For systems like Dropbox Dash, which deals with millions or billions of documents, the initial search ranking directly impacts the quality of the final generated answer.
RAG System Bottleneck
The quality of search ranking and the labeled relevance data used to train it are paramount for the overall effectiveness of a RAG system. If the retrieval step fails to identify the most relevant documents, even a powerful LLM will produce suboptimal results.
Traditionally, relevance labeling for training search ranking models is done by human judges, a process that is expensive, slow, and can be inconsistent. Dropbox introduced a human-calibrated LLM labeling approach to overcome these limitations. This method leverages LLMs to generate relevance judgments at scale, amplifying human effort by roughly 100x.
It's important to note that LLMs are used for *labeling* and *evaluation*, not for real-time query-time ranking. Using LLMs directly for ranking would be too slow and limited by context window constraints, underscoring the architectural decision to keep the ranking model separate and trained with this augmented data.