This article introduces an AI agent-driven framework for rigorously testing distributed and stateful systems. It emphasizes a claim-driven methodology, moving beyond traditional integration testing to identify complex bugs related to partial network partitions, concurrency, and crash-recovery. The system leverages AI agents to design comprehensive test plans, execute scenarios with fault injection, and generate detailed findings reports with blame classification, enhancing the reliability of distributed systems.
Read original on Hacker NewsTesting distributed systems reliably is a critical challenge in software architecture due to inherent complexities like concurrency, network partitions, and state management. Traditional testing often falls short, missing subtle bugs that manifest in production. This article presents an innovative approach using AI agents to automate and enhance the testing process for these complex systems.
The core of this framework is a claim-driven approach, which shifts the focus from test-driven development to verifying product promises. Each scenario is designed to falsify a specific product claim under a given fault, making tests more robust and less susceptible to being weakened over time. This approach ensures explicit coverage adequacy as a deliverable.
Key Principles of Claim-Driven Testing
Start from what the product promises (claims). Every scenario attempts to falsify one claim under one fault. Name tests after their claim for clarity and resistance to weakening. Explicitly argue for coverage adequacy, detailing what remains unverified.
For consistency-critical aspects (safety, durability, idempotency, isolation, ordering, membership), each scenario binds an abstract model (e.g., register, queue, log) to an operation-history schema and a named checker. This moves beyond mere chaos engineering by combining fault injection with formal verification through models and checkers (e.g., linearizability, serializability). Every test verdict is a 9-state classification, preventing silent passes and pinpointing blame (SUT, harness, checker, environment).
### Scenario S3: linearizable_append_under_partition
- Falsifies if it FAILs: C1 (every acknowledged append is durable and linearisable), C5 (leader election completes within 5s)
- Workload: 8 clients, 70% append / 30% read, 5min, key-skew zipf
- Faults: asymmetric partition isolating current leader at T+60s for 30s
- Oracle: linearizability via Porcupine over per-key histories
§7.M (model / history / checker discipline)
- Model under test: log
- Operation history: default 11-field schema (...)
- Checker: linearizability (Porcupine) per-key, then no-lost-ack against final state
- Nemesis + landing: asymmetric-partition (iptables drop one direction). Landing evidence = iptables drop counter goes 0 → 14,712 over the 30s window AND raft log emits "leader-lost; starting election" within 2s of injection.
- Ambiguous outcomes: timeouts → timeout_marker=true, complete_ts =null, treated as could-have-succeeded; retries are separate ops sharing input
- Reduction plan: if FAIL, bisect fault window + fix seed, then classify SUT / harness / checker / environment per references/test-case-reduction.md