Dropbox Tech·March 17, 2026

Optimizing LLM-based Relevance Judges for Scale and Reliability with DSPy

This article from Dropbox details their process of optimizing the relevance judge for Dropbox Dash, which relies on LLMs. They address challenges like prompt brittleness, cost, and operational reliability when migrating from powerful, expensive models to smaller, more cost-effective ones. The core of their solution involves using DSPy to systematically optimize prompts, improving human alignment and ensuring consistent, machine-readable outputs at scale.

AI & ML Infrastructure Performance & Scaling Tools & Frameworks

Read original on Dropbox Tech

The Challenge of LLM-as-a-Judge in Production

Integrating Large Language Models (LLMs) as relevance judges in production systems, such as Dropbox Dash's search and knowledge retrieval, presents significant architectural challenges. While powerful models offer high accuracy, their cost and latency often prohibit large-scale deployment. Migrating to smaller, cheaper models introduces issues like prompt brittleness, where carefully tuned prompts for one model don't transfer, leading to quality degradation and extensive manual re-tuning. This highlights a critical system design trade-off between model performance/cost and engineering effort/reliability.

ℹ️

Why System Design Matters for LLM Integration

When building systems dependent on LLMs, consider the entire lifecycle: initial prototyping, scaling, cost management, and operational stability. A powerful model might work well in a prototype, but a production system requires a robust strategy for model adaptation, prompt engineering, and output validation to ensure reliability and cost-effectiveness at scale. Manual prompt tuning quickly becomes a bottleneck and a source of regressions as models evolve or are swapped out.

DSPy: A Framework for Systematic Prompt Optimization

Dropbox leveraged DSPy, an open-source framework, to systematically optimize their LLM-based relevance judge. DSPy transforms manual, fragile prompt tuning into a repeatable optimization loop by defining a clear objective function (e.g., minimizing disagreement with human judgments, ensuring valid output format). This allows engineers to adapt prompts for different models while maintaining performance and improving operational reliability.

Key aspects of DSPy's approach include:

Objective Definition: Clearly defining what "good" means for the judge, often involving metrics like Normalized Mean Squared Error (NMSE) against human annotations and output format validity.
Systematic Search: DSPy iteratively searches for prompt variants that improve performance on the defined metrics, moving beyond intuition-based manual tuning.
Structured Feedback: Using methods like GEPA (Generator of Examples for Prompt Adaptation), DSPy generates concrete feedback from model disagreements (e.g., "model rated X points higher/lower than expected"), including human rationale and model's reasoning, to guide prompt revision.
Guardrails and Constraints: Implementing explicit rules to prevent overfitting (e.g., forbidding direct keyword copying) and ensure task parameters remain stable (e.g., rating scale 1-5).

Achieving Cost Reduction and Operational Reliability

By applying DSPy, Dropbox achieved significant improvements when migrating their relevance judge from a powerful, expensive model (OpenAI o3) to a cheaper, open-weight model (gpt-oss-120b). They reduced NMSE by 45%, meaning the judge's scores tracked human ratings more closely, and slashed model adaptation time from weeks to days. This allowed them to label 10-100 times more data at the same cost, increasing coverage and enabling larger experiments.

Beyond human alignment, operational reliability was a critical concern. Smaller models are prone to producing malformed outputs (e.g., invalid JSON), which can break downstream systems. By optimizing with DSPy against a smaller, more brittle model (gemma-3-12b), Dropbox reduced malformed JSON outputs by over 97%, making the judge consistently usable in automated pipelines. This demonstrates that robust system design for LLM integration must account for both qualitative performance and quantitative operational metrics.

LLMPrompt EngineeringDSPyRelevance JudgmentModel OptimizationCost EfficiencySystem ReliabilityMachine Learning Operations

Comments

Loading comments...

Architecture Design

Design this yourself

Design a scalable and cost-efficient search and knowledge retrieval system for an enterprise platform like Dropbox Dash. Focus on the architecture of the LLM-based relevance judging component, including strategies for prompt optimization using frameworks like DSPy, model adaptation for different cost/performance tiers, ensuring operational reliability (e.g., valid JSON outputs), and integrating with downstream ranking and data generation pipelines.

Practice Interview

Focus: LLM-based relevance judge with prompt optimization

Other design angles

· Design a system to continuously optimize LLM prompts for various tasks (e.g., summarization, classification) in a multi-tenant SaaS application, ensuring high accuracy and cost efficiency across different base models.· Design an offline evaluation and training data generation pipeline for an AI-powered search engine, detailing how a human-aligned and reliable LLM-as-a-judge component can be integrated to improve data quality and quantity.· Architect a microservice for query-document relevance scoring using LLMs, focusing on how to manage model versions, prompt configurations, and performance trade-offs (latency vs. accuracy vs. cost) in a production environment.