HubSpot's Sidekick is an internal AI code review agent that uses large language models to automate pull request feedback. Initially built on a containerized platform, it evolved into a Java-based agent framework for improved efficiency and control. A key architectural decision was the introduction of a "judge agent" to refine feedback quality, leading to significantly faster code reviews and high engineer approval.
Read original on InfoQ ArchitectureHubSpot developed Sidekick, an AI-powered code review agent, to address bottlenecks in manual code review processes. The system's architecture underwent a significant evolution from its initial prototype to a more integrated and scalable solution. Understanding this evolution highlights common challenges and solutions in building AI-driven internal tools.
The first version of Sidekick leveraged large language models (LLMs) running as containerized agents within a Kubernetes environment, orchestrated by an internal platform called Crucible. These agents interacted with GitHub repositories via the command line, retrieving pull request changes and generating review comments based on predefined prompts. While proving the concept, this approach faced several architectural limitations:
To overcome the limitations of the initial design, HubSpot migrated Sidekick to Aviator, a Java-based agent framework. This strategic shift brought several architectural advantages:
A crucial architectural pattern introduced to address feedback quality challenges was the "judge agent." Early versions of Sidekick sometimes produced verbose or low-value comments. The judge agent acts as an intermediary, evaluating comments generated by the primary review agent before they are posted to pull request discussions. This 'evaluator pattern' significantly reduced noise and improved the signal-to-noise ratio, leading to an 80% approval rate from engineers.
Architectural Takeaway: The Evaluator Pattern
When designing AI-powered systems that generate content or feedback, consider implementing an additional AI layer (a "judge" or "evaluator" agent) to filter, refine, or validate the primary output. This pattern helps maintain quality, reduce noise, and increase user trust by ensuring only high-value information is presented.