Menu
ByteByteGo·May 30, 2026

DoorDash's LLM Testing System: The Simulation and Evaluation Flywheel

DoorDash developed a 'simulation and evaluation flywheel' to address the non-deterministic nature of LLM-based customer support chatbots. This system enables rapid, offline testing and iteration of prompt and architectural changes by simulating multi-turn customer conversations and automatically evaluating chatbot performance against defined policies. It highlights a critical architectural shift for reliable LLM integration.

Read original on ByteByteGo

Transitioning from deterministic, rule-based systems to non-deterministic LLM-based applications introduces significant testing and quality assurance challenges. DoorDash faced this when their customer support chatbot exhibited subtle hallucinations, making traditional testing methods (manual scenario testing, A/B testing in production) infeasible due to time constraints and risk to customer experience.

The Simulation and Evaluation Flywheel Architecture

DoorDash's solution is a simulation and evaluation flywheel, a continuous iteration loop designed to improve LLM-based chatbots. It comprises two core components:

  • Offline Simulator: Generates realistic, multi-turn customer conversations using an LLM to play the customer role, dynamically responding based on detailed test scenarios derived from historical support transcripts.
  • Evaluation Framework: Automatically grades chatbot performance in simulated conversations using another LLM. This 'judge' LLM verifies narrowly defined behaviors (e.g., policy adherence) against human-calibrated benchmarks.
💡

Generator-Verifier Gap

The system leverages the principle that while generating complex, open-ended responses is hard for an LLM (the 'generator'), verifying a specific, binary condition is a much simpler and more reliable task (the 'verifier'). This allows an LLM to effectively evaluate another LLM's output by focusing on narrow policy checks.

LLM testingAI/ML opschatbotsystem designnon-determinismsimulationevaluationDoordash

Comments

Loading comments...