Slack Engineering·June 11, 2026

Evaluating Agent-Driven E2E Testing Architectures: Trade-offs in Reliability, Speed, and Cost

This article explores the architectural considerations and trade-offs of integrating agent-driven end-to-end (E2E) testing into existing development workflows. It details an experiment comparing different execution models (Playwright MCP, Playwright CLI, Generated Tests) in terms of reliability, speed, and cost, highlighting the impact of context management and execution environment on performance and resource consumption. The findings offer insights into where agentic testing best fits within a comprehensive testing strategy, emphasizing its role in exploratory testing due to higher costs and flexibility.

Distributed Systems Tools & Frameworks AI & ML Infrastructure

Read original on Slack Engineering

Introduction to Agentic E2E Testing

Agent-driven E2E tests represent a paradigm shift from traditional, deterministic tests. Instead of enforcing a specific UI journey (e.g., click -> type -> assert), agents aim to achieve a *goal* (e.g., "send a thread message") by dynamically navigating the UI. This introduces flexibility, as agents can take varied paths to reach the same outcome, but also presents new challenges in terms of reliability, cost, and execution speed. Understanding these trade-offs is crucial for integrating such systems effectively.

Architectural Experiment: Execution Models Compared

Slack's experiment compared three primary architectural approaches for agent-driven E2E testing:

Agent + Playwright MCP (Message Channel Protocol): The agent interacts directly with the browser through a high-level protocol, maintaining persistent context and observing DOM state. This model often yielded higher reliability.
Agent + Playwright CLI: The agent executes one-off Playwright CLI commands via the shell, deciding the next action based on updated UI snapshots. This approach was less reliable due to state rebuilding and timing issues.
Generated Playwright Tests: An AI agent generates deterministic Playwright code from natural language descriptions, which is then executed as a standard E2E test. This was the fastest but least reliable for complex scenarios.

ℹ️

Key Architectural Decision Point: State Management

The choice of how the agent interacts with and maintains the browser's state (e.g., via a persistent MCP connection vs. stateless CLI commands or generated code) significantly impacts reliability and cost. Persistent context, like that provided by Playwright MCP, generally leads to more stable and consistent test runs, especially for complex flows, by reducing inconsistencies and allowing the agent to reuse previous successful interactions.

Performance and Cost Implications

Approach	Avg Runtime (Search Discovery)	Failure Rate (Search Discovery)	Avg Cost (per run)

The study revealed significant cost differences, with agent-driven runs costing $15–30 per execution compared to much cheaper traditional tests. This cost is primarily driven by token usage in LLM interactions. The underlying API's stateless nature means each turn re-sends the full system prompt and entire conversation history. Therefore, factors like the number of turns an agent takes and the rate of context accumulation (e.g., browser snapshots) are more critical cost drivers than the LLM's reasoning output. This suggests that agentic testing is currently better suited for targeted debugging or exploratory testing rather than high-frequency continuous integration (CI) execution, although future optimizations could improve its cost-effectiveness.

Infrastructure Matters: MCP vs. CLI

A crucial takeaway is that the execution environment and interaction protocol profoundly affect reliability. Playwright MCP, by providing a live, stable view of the application and combining interaction and state return into a single round trip, proved more reliable than the CLI approach, which rebuilds state from snapshots at each step. This highlights that the architectural design of the testing harness and how it manages browser state and context is as critical as the LLM's capabilities for successful agentic E2E testing.

agentic testinge2e testingplaywrightllmtesting architecturecost optimizationreliabilitysoftware testing