Menu
InfoQ Architecture·March 7, 2026

Scaling Human Judgment with LLMs for RAG Systems

Dropbox engineers have developed a method using Large Language Models (LLMs) to significantly improve the quality and scale of data labeling for Retrieval-Augmented Generation (RAG) systems. This approach addresses the bottleneck of document retrieval quality in RAG by amplifying human judgment for training search ranking models. It offers a scalable and cost-effective solution for enterprises dealing with vast document repositories, combining automated LLM labeling with human calibration and oversight.

Read original on InfoQ Architecture

The Challenge of RAG Systems and Document Retrieval

Retrieval-Augmented Generation (RAG) systems are critical for generating relevant responses from vast document repositories. A core bottleneck in these systems is the quality of document retrieval, specifically how well the system identifies and ranks relevant content before passing it to an LLM. For systems like Dropbox Dash, which deals with millions or billions of documents, the initial search ranking directly impacts the quality of the final generated answer.

ℹ️

RAG System Bottleneck

The quality of search ranking and the labeled relevance data used to train it are paramount for the overall effectiveness of a RAG system. If the retrieval step fails to identify the most relevant documents, even a powerful LLM will produce suboptimal results.

Human-Calibrated LLM Labeling Architecture

Traditionally, relevance labeling for training search ranking models is done by human judges, a process that is expensive, slow, and can be inconsistent. Dropbox introduced a human-calibrated LLM labeling approach to overcome these limitations. This method leverages LLMs to generate relevance judgments at scale, amplifying human effort by roughly 100x.

  1. Human Labeling of Small Dataset: A small, high-quality dataset of query-document pairs is initially labeled by human experts.
  2. LLM Calibraton: This human-labeled dataset is used to calibrate the LLM evaluator, teaching it to align with human judgment.
  3. LLM at Scale Labeling: The calibrated LLM then generates hundreds of thousands or millions of additional labels for new query-document pairs.
  4. Human Oversight and Evaluation: LLM-generated labels are not used blindly. A critical evaluation step compares LLM ratings with human judgments on a separate test subset. Focus is placed on 'hardest mistakes' where LLM and human judgments disagree or conflict with user behavior (e.g., clicks/skips).
  5. Contextual Understanding: To improve accuracy, LLMs are designed to perform additional searches and understand internal terminology, which is crucial for contextual relevance (e.g., 'diet sprite' having a specific internal meaning).

It's important to note that LLMs are used for *labeling* and *evaluation*, not for real-time query-time ranking. Using LLMs directly for ranking would be too slow and limited by context window constraints, underscoring the architectural decision to keep the ranking model separate and trained with this augmented data.

LLMRAGData LabelingSearch RankingMachine LearningScalabilityHuman-in-the-LoopInformation Retrieval

Comments

Loading comments...