This article from Pinterest Engineering details an approach to systematically test and improve the reliability of AI agent skill invocation within engineering workflows. It outlines the development of a test harness to quantify skill loading rates for domain-specific tasks, such as understanding iOS architecture patterns. The findings highlight various optimization techniques, including contextual descriptions and explicit instructions, to significantly enhance agent performance and ensure consistent skill utilization.
Read original on Pinterest EngineeringThe increasing adoption of AI agents in engineering workflows, particularly for automating tasks and providing domain-specific knowledge, introduces challenges around their reliability. Pinterest engineers observed that agents, specifically internal forks of OpenAI's Codex (Pin-agent) and Claude Code, sometimes failed to invoke custom skills, leading to inefficiencies. This prompted the development of a testing methodology to understand and optimize skill invocation performance.
To address the unreliability, a test harness was built to systematically evaluate agent performance. This harness consists of a Bash script that orchestrates automated testing by piping prompts to agents and capturing verbose output logs. The core execution flow involves running a suite of categorized prompts (positive and negative cases) multiple times to account for the non-deterministic nature of AI agents.
if echo "$prompt" | claude --print --verbose --output-format stream-json > "$log_file" 2>&1; then
command_success=true
fiLog parsing heuristics are then applied to the JSON-streamed debug output to detect successful skill invocations. This involves searching for specific patterns that indicate the agent has loaded and utilized the intended skill. Key metrics tracked include CORE_SUCCESS_RATE, EDGE_FALSE_POSITIVE_RATE, and OVERALL_ACCURACY to provide a quantifiable measure of performance.
| Metric | Formula |
|---|
Initial testing revealed that agents had an overall accuracy of 73% (Codex) and 62% (Claude) for skill invocation, which was deemed unacceptable for critical workflows. Several optimization techniques were identified and implemented:
System Design Implication
While AI agents offer significant automation potential, their reliability in invoking specific skills can be a critical bottleneck. Implementing a robust testing framework and applying systematic optimization strategies, such as providing rich context and clear instructions, are essential architectural considerations for integrating agents into production engineering workflows.