This article from Dropbox details their process of optimizing the relevance judge for Dropbox Dash, which relies on LLMs. They address challenges like prompt brittleness, cost, and operational reliability when migrating from powerful, expensive models to smaller, more cost-effective ones. The core of their solution involves using DSPy to systematically optimize prompts, improving human alignment and ensuring consistent, machine-readable outputs at scale.
Read original on Dropbox TechIntegrating Large Language Models (LLMs) as relevance judges in production systems, such as Dropbox Dash's search and knowledge retrieval, presents significant architectural challenges. While powerful models offer high accuracy, their cost and latency often prohibit large-scale deployment. Migrating to smaller, cheaper models introduces issues like prompt brittleness, where carefully tuned prompts for one model don't transfer, leading to quality degradation and extensive manual re-tuning. This highlights a critical system design trade-off between model performance/cost and engineering effort/reliability.
Why System Design Matters for LLM Integration
When building systems dependent on LLMs, consider the entire lifecycle: initial prototyping, scaling, cost management, and operational stability. A powerful model might work well in a prototype, but a production system requires a robust strategy for model adaptation, prompt engineering, and output validation to ensure reliability and cost-effectiveness at scale. Manual prompt tuning quickly becomes a bottleneck and a source of regressions as models evolve or are swapped out.
Dropbox leveraged DSPy, an open-source framework, to systematically optimize their LLM-based relevance judge. DSPy transforms manual, fragile prompt tuning into a repeatable optimization loop by defining a clear objective function (e.g., minimizing disagreement with human judgments, ensuring valid output format). This allows engineers to adapt prompts for different models while maintaining performance and improving operational reliability.
Key aspects of DSPy's approach include:
By applying DSPy, Dropbox achieved significant improvements when migrating their relevance judge from a powerful, expensive model (OpenAI o3) to a cheaper, open-weight model (gpt-oss-120b). They reduced NMSE by 45%, meaning the judge's scores tracked human ratings more closely, and slashed model adaptation time from weeks to days. This allowed them to label 10-100 times more data at the same cost, increasing coverage and enabling larger experiments.
Beyond human alignment, operational reliability was a critical concern. Smaller models are prone to producing malformed outputs (e.g., invalid JSON), which can break downstream systems. By optimizing with DSPy against a smaller, more brittle model (gemma-3-12b), Dropbox reduced malformed JSON outputs by over 97%, making the judge consistently usable in automated pipelines. This demonstrates that robust system design for LLM integration must account for both qualitative performance and quantitative operational metrics.