This article explores the architectural considerations and trade-offs of integrating agent-driven end-to-end (E2E) testing into existing development workflows. It details an experiment comparing different execution models (Playwright MCP, Playwright CLI, Generated Tests) in terms of reliability, speed, and cost, highlighting the impact of context management and execution environment on performance and resource consumption. The findings offer insights into where agentic testing best fits within a comprehensive testing strategy, emphasizing its role in exploratory testing due to higher costs and flexibility.
Read original on Slack EngineeringAgent-driven E2E tests represent a paradigm shift from traditional, deterministic tests. Instead of enforcing a specific UI journey (e.g., click -> type -> assert), agents aim to achieve a *goal* (e.g., "send a thread message") by dynamically navigating the UI. This introduces flexibility, as agents can take varied paths to reach the same outcome, but also presents new challenges in terms of reliability, cost, and execution speed. Understanding these trade-offs is crucial for integrating such systems effectively.
Slack's experiment compared three primary architectural approaches for agent-driven E2E testing:
Key Architectural Decision Point: State Management
The choice of how the agent interacts with and maintains the browser's state (e.g., via a persistent MCP connection vs. stateless CLI commands or generated code) significantly impacts reliability and cost. Persistent context, like that provided by Playwright MCP, generally leads to more stable and consistent test runs, especially for complex flows, by reducing inconsistencies and allowing the agent to reuse previous successful interactions.
| Approach | Avg Runtime (Search Discovery) | Failure Rate (Search Discovery) | Avg Cost (per run) |
|---|
The study revealed significant cost differences, with agent-driven runs costing $15–30 per execution compared to much cheaper traditional tests. This cost is primarily driven by token usage in LLM interactions. The underlying API's stateless nature means each turn re-sends the full system prompt and entire conversation history. Therefore, factors like the number of turns an agent takes and the rate of context accumulation (e.g., browser snapshots) are more critical cost drivers than the LLM's reasoning output. This suggests that agentic testing is currently better suited for targeted debugging or exploratory testing rather than high-frequency continuous integration (CI) execution, although future optimizations could improve its cost-effectiveness.
A crucial takeaway is that the execution environment and interaction protocol profoundly affect reliability. Playwright MCP, by providing a live, stable view of the application and combining interaction and state return into a single round trip, proved more reliable than the CLI approach, which rebuilds state from snapshots at each step. This highlights that the architectural design of the testing harness and how it manages browser state and context is as critical as the LLM's capabilities for successful agentic E2E testing.