ByteByteGo·May 30, 2026

DoorDash's LLM Testing System: The Simulation and Evaluation Flywheel

DoorDash developed a 'simulation and evaluation flywheel' to address the non-deterministic nature of LLM-based customer support chatbots. This system enables rapid, offline testing and iteration of prompt and architectural changes by simulating multi-turn customer conversations and automatically evaluating chatbot performance against defined policies. It highlights a critical architectural shift for reliable LLM integration.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on ByteByteGo

Transitioning from deterministic, rule-based systems to non-deterministic LLM-based applications introduces significant testing and quality assurance challenges. DoorDash faced this when their customer support chatbot exhibited subtle hallucinations, making traditional testing methods (manual scenario testing, A/B testing in production) infeasible due to time constraints and risk to customer experience.

The Simulation and Evaluation Flywheel Architecture

DoorDash's solution is a simulation and evaluation flywheel, a continuous iteration loop designed to improve LLM-based chatbots. It comprises two core components:

Offline Simulator: Generates realistic, multi-turn customer conversations using an LLM to play the customer role, dynamically responding based on detailed test scenarios derived from historical support transcripts.
Evaluation Framework: Automatically grades chatbot performance in simulated conversations using another LLM. This 'judge' LLM verifies narrowly defined behaviors (e.g., policy adherence) against human-calibrated benchmarks.

💡

Generator-Verifier Gap

The system leverages the principle that while generating complex, open-ended responses is hard for an LLM (the 'generator'), verifying a specific, binary condition is a much simpler and more reliable task (the 'verifier'). This allows an LLM to effectively evaluate another LLM's output by focusing on narrow policy checks.

LLM testingAI/ML opschatbotsystem designnon-determinismsimulationevaluationDoordash

Comments

Loading comments...

Architecture Design

Design this yourself

Design a robust, scalable testing and evaluation system for an LLM-powered customer support chatbot, incorporating components for realistic conversation simulation (using LLMs for customer roles), automated evaluation (using LLMs for policy adherence checks), and a feedback loop for continuous improvement and regression testing.

Practice Interview

Focus: LLM testing and evaluation system

Other design angles

· Design a system to continuously monitor and evaluate the performance of multiple LLM models in production, including drift detection and automated re-training triggers.· Design a comprehensive MLOps pipeline for LLM-based applications, covering data preparation, model training, deployment, and a feedback loop for model refinement.· Design a synthetic data generation platform to create diverse and realistic test scenarios for AI systems, including methods to capture edge cases and rare events.

DoorDash's LLM Testing System: The Simulation and Evaluation Flywheel

The Simulation and Evaluation Flywheel Architecture

Comments

Architecture Design

Related Lessons