Dropbox Tech·June 25, 2026

Optimizing AI Agent Performance with Evaluation Feedback Loops at Dropbox

This article details how Dropbox improved its Dash chat AI agent by establishing a robust evaluation layer and leveraging DSPy for optimization. It focuses on the system and process of calibrating LLM judges against human labels and then using these improved judges to refine the agent's system prompt, creating a continuous feedback loop for AI system improvement. This architecture significantly reduced incomplete answers and token usage.

AI & ML Infrastructure Distributed Systems Performance & Scaling

Read original on Dropbox Tech

The Challenge of AI Agent Evaluation

Evaluating AI agents, especially conversational ones like Dropbox's Dash chat, is significantly more complex than traditional single-output evaluations (e.g., search relevance). An agent's response is the culmination of a multi-step process involving intent interpretation, context gathering, tool usage, information synthesis, and multi-turn interactions. This necessitates a comprehensive evaluation framework that assesses not just the final answer, but also the underlying decisions and component interactions, such as intent understanding, context selection, tool use, and grounding.

Building a Reliable LLM-as-Judge System

To ensure reliable agent improvements, Dropbox first focused on the quality of their LLM-based evaluation judges. The process involved:

Human-Labeled Data: A small, internal dataset of chat interactions was human-labeled across dimensions like user intent following, semantic relevance, tool calling, instruction following, and context selection.
Structured Rubric: Human evaluators followed a consistent rubric and review process, providing not only scores but also reasoning notes and failure codes (e.g., stale evidence, missing context) to pinpoint issues.
DSPy Optimization: The open-source DSPy framework, with algorithms like GEPA and MIPROv2, was used to optimize the LLM judges' prompts. This calibrated the judges to align more closely with human judgment and accurately reflect the structured evaluation process.

💡

Feedback Loop Principle

This approach highlights a crucial system design principle: before optimizing the core system, ensure your measurement tools (in this case, the LLM judges) are accurate and reliable. A flawed evaluation system will lead to misguided optimizations.

From Evaluation to Agent Improvement

Once the judges were calibrated, they formed the foundation for an automated, evaluation-driven agent optimization loop. Instead of manual prompt engineering, DSPy was used to optimize the chat agent's system prompt. This involved:

Offline Counterfactual Replay: Candidate agent prompts were tested against historical production chats, and the resulting agent outputs were scored by the refined LLM judges.
Automated Prompt Generation: DSPy's optimization algorithms used these scores and structured judge reasoning as feedback to automatically propose new prompt updates.
Targeted Optimization: The process focused on concrete failure modes identified by the judges, such as wrong context selection, incomplete answers, or incorrect tool use.

This automated feedback loop significantly increased experimentation velocity, leading to a 26% reduction in incomplete answers and a 13% reduction in missed key aspects, while also reducing token usage.

AIMachine LearningLLMAgent EvaluationDSPyPrompt EngineeringFeedback LoopsSystem Optimization

Comments

Loading comments...

Architecture Design

Design this yourself

Design a scalable AI agent platform for an enterprise knowledge base, similar to Dropbox Dash. Focus on the architecture of the evaluation system, including how to implement LLM-as-judge functionality, calibrate judges with human feedback, and integrate automated prompt optimization loops using frameworks like DSPy for continuous agent improvement.

Practice Interview

Focus: AI agent evaluation system with LLM-as-judge and prompt optimization

Other design angles

· Design a real-time conversational AI system focusing on the data pipeline for continuous learning and re-training based on user interactions and explicit feedback.· Architect a multi-tenant AI-powered search and summarization service, detailing how to manage prompt variations and model fine-tuning for different tenants while maintaining evaluation consistency.· Design an automated system for A/B testing and rollouts of new AI agent prompts and models, ensuring robust evaluation metrics and graceful degradation strategies.