InfoQ Architecture·March 7, 2026

Scaling Human Judgment with LLMs for RAG Systems

Dropbox engineers have developed a method using Large Language Models (LLMs) to significantly improve the quality and scale of data labeling for Retrieval-Augmented Generation (RAG) systems. This approach addresses the bottleneck of document retrieval quality in RAG by amplifying human judgment for training search ranking models. It offers a scalable and cost-effective solution for enterprises dealing with vast document repositories, combining automated LLM labeling with human calibration and oversight.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on InfoQ Architecture

The Challenge of RAG Systems and Document Retrieval

Retrieval-Augmented Generation (RAG) systems are critical for generating relevant responses from vast document repositories. A core bottleneck in these systems is the quality of document retrieval, specifically how well the system identifies and ranks relevant content before passing it to an LLM. For systems like Dropbox Dash, which deals with millions or billions of documents, the initial search ranking directly impacts the quality of the final generated answer.

ℹ️

RAG System Bottleneck

The quality of search ranking and the labeled relevance data used to train it are paramount for the overall effectiveness of a RAG system. If the retrieval step fails to identify the most relevant documents, even a powerful LLM will produce suboptimal results.

Human-Calibrated LLM Labeling Architecture

Traditionally, relevance labeling for training search ranking models is done by human judges, a process that is expensive, slow, and can be inconsistent. Dropbox introduced a human-calibrated LLM labeling approach to overcome these limitations. This method leverages LLMs to generate relevance judgments at scale, amplifying human effort by roughly 100x.

Human Labeling of Small Dataset: A small, high-quality dataset of query-document pairs is initially labeled by human experts.
LLM Calibraton: This human-labeled dataset is used to calibrate the LLM evaluator, teaching it to align with human judgment.
LLM at Scale Labeling: The calibrated LLM then generates hundreds of thousands or millions of additional labels for new query-document pairs.
Human Oversight and Evaluation: LLM-generated labels are not used blindly. A critical evaluation step compares LLM ratings with human judgments on a separate test subset. Focus is placed on 'hardest mistakes' where LLM and human judgments disagree or conflict with user behavior (e.g., clicks/skips).
Contextual Understanding: To improve accuracy, LLMs are designed to perform additional searches and understand internal terminology, which is crucial for contextual relevance (e.g., 'diet sprite' having a specific internal meaning).

It's important to note that LLMs are used for *labeling* and *evaluation*, not for real-time query-time ranking. Using LLMs directly for ranking would be too slow and limited by context window constraints, underscoring the architectural decision to keep the ranking model separate and trained with this augmented data.

LLMRAGData LabelingSearch RankingMachine LearningScalabilityHuman-in-the-LoopInformation Retrieval

Comments

Loading comments...

Architecture Design

Design this yourself

Design a scalable data labeling system for a large-scale Retrieval-Augmented Generation (RAG) platform, similar to Dropbox Dash. Your design should incorporate human-calibrated Large Language Models (LLMs) to amplify labeling efforts, ensure high-quality relevance judgments, and handle contextual understanding for millions of documents. Detail the architecture for data flow, LLM integration, human oversight, and evaluation mechanisms to optimize the training of a core search ranking model.

Practice Interview

Focus: scalable relevance labeling system for RAG using human-calibrated LLMs

Other design angles

· Design a system specifically for the offline training pipeline of a search ranking model, focusing on how LLMs generate and validate synthetic training data at scale.· Design a user feedback loop mechanism that integrates with a RAG system to continuously improve relevance labels and LLM calibration based on user interactions like clicks and implicit feedback.· Design a multi-tenant RAG platform where each tenant might have unique domain-specific terminology, and explore how the LLM-based labeling system adapts to these varying contexts efficiently.

Scaling Human Judgment with LLMs for RAG Systems

The Challenge of RAG Systems and Document Retrieval

Human-Calibrated LLM Labeling Architecture

Comments

Architecture Design

Related Lessons