This article details how Dropbox improved its Dash chat AI agent by establishing a robust evaluation layer and leveraging DSPy for optimization. It focuses on the system and process of calibrating LLM judges against human labels and then using these improved judges to refine the agent's system prompt, creating a continuous feedback loop for AI system improvement. This architecture significantly reduced incomplete answers and token usage.
Read original on Dropbox TechEvaluating AI agents, especially conversational ones like Dropbox's Dash chat, is significantly more complex than traditional single-output evaluations (e.g., search relevance). An agent's response is the culmination of a multi-step process involving intent interpretation, context gathering, tool usage, information synthesis, and multi-turn interactions. This necessitates a comprehensive evaluation framework that assesses not just the final answer, but also the underlying decisions and component interactions, such as intent understanding, context selection, tool use, and grounding.
To ensure reliable agent improvements, Dropbox first focused on the quality of their LLM-based evaluation judges. The process involved:
Feedback Loop Principle
This approach highlights a crucial system design principle: before optimizing the core system, ensure your measurement tools (in this case, the LLM judges) are accurate and reliable. A flawed evaluation system will lead to misguided optimizations.
Once the judges were calibrated, they formed the foundation for an automated, evaluation-driven agent optimization loop. Instead of manual prompt engineering, DSPy was used to optimize the chat agent's system prompt. This involved:
This automated feedback loop significantly increased experimentation velocity, leading to a 26% reduction in incomplete answers and a 13% reduction in missed key aspects, while also reducing token usage.