InfoQ Architecture·March 18, 2026

HubSpot's Sidekick: Architecture of an AI-Powered Code Review System

HubSpot's Sidekick is an internal AI code review agent that uses large language models to automate pull request feedback. Initially built on a containerized platform, it evolved into a Java-based agent framework for improved efficiency and control. A key architectural decision was the introduction of a "judge agent" to refine feedback quality, leading to significantly faster code reviews and high engineer approval.

AI & ML Infrastructure Microservices DevOps & SRE

Read original on InfoQ Architecture

Evolution of HubSpot's Sidekick Architecture

HubSpot developed Sidekick, an AI-powered code review agent, to address bottlenecks in manual code review processes. The system's architecture underwent a significant evolution from its initial prototype to a more integrated and scalable solution. Understanding this evolution highlights common challenges and solutions in building AI-driven internal tools.

Initial Prototype: Containerized LLM Agents

The first version of Sidekick leveraged large language models (LLMs) running as containerized agents within a Kubernetes environment, orchestrated by an internal platform called Crucible. These agents interacted with GitHub repositories via the command line, retrieving pull request changes and generating review comments based on predefined prompts. While proving the concept, this approach faced several architectural limitations:

Operational Complexity: Each review spawned separate containerized workloads, leading to increased latency due to spin-up times.
Infrastructure Overhead: Managing numerous isolated containers for each review added significant resource consumption.
Limited Integration: Restricted control over agent interactions with internal developer tooling and services, hindering deep context understanding.

Migrated Architecture: Java-Based Agent Framework (Aviator)

To overcome the limitations of the initial design, HubSpot migrated Sidekick to Aviator, a Java-based agent framework. This strategic shift brought several architectural advantages:

Improved Integration: Agents now run within existing HubSpot services, eliminating the need for isolated workloads and fostering better integration with the broader development platform.
Reduced Latency & Overhead: Reusing existing service infrastructure reduces spin-up times and overall operational costs.
Multi-Model Support: Aviator's design supports multiple LLM providers (Anthropic, OpenAI, Google), enabling flexibility, experimentation with different models, and fallback mechanisms.
Context Retrieval: RPC-based tool abstractions allow agents to retrieve rich repository context (e.g., configuration settings, coding conventions), significantly improving the relevance and accuracy of automated comments.

Enhancing Feedback Quality with a 'Judge Agent'

A crucial architectural pattern introduced to address feedback quality challenges was the "judge agent." Early versions of Sidekick sometimes produced verbose or low-value comments. The judge agent acts as an intermediary, evaluating comments generated by the primary review agent before they are posted to pull request discussions. This 'evaluator pattern' significantly reduced noise and improved the signal-to-noise ratio, leading to an 80% approval rate from engineers.

💡

Architectural Takeaway: The Evaluator Pattern

When designing AI-powered systems that generate content or feedback, consider implementing an additional AI layer (a "judge" or "evaluator" agent) to filter, refine, or validate the primary output. This pattern helps maintain quality, reduce noise, and increase user trust by ensuring only high-value information is presented.

AILLMCode ReviewKubernetesJavaDistributed SystemsSoftware Development LifecycleAutomation