The New Stack·March 25, 2026

Leveraging AI Agents for Proactive IT Operations and Root Cause Analysis

This article explores how HPE's AI agents are transforming IT operations by reducing operational fatigue and accelerating root cause analysis. It focuses on the shift from human-centric incident response to an 'Agentic Era' where specialized AI agents collaborate with human operators, improving efficiency and reducing burnout in complex, hybrid cloud environments. The core system design challenge lies in building auditable, explainable, and trustworthy AI-driven operational systems.

DevOps & SRE AI & ML Infrastructure Distributed Systems

Read original on The New Stack

The increasing complexity of enterprise IT environments, especially in hybrid and multi-cloud setups, leads to significant operational fatigue, alert sprawl, and burnout among SRE and DevOps teams. Traditional manual incident response struggles to keep pace with the volume and speed of changes, including those introduced by AI-produced code. This scenario highlights a critical need for advanced operational tools that can augment human capabilities without introducing more noise or false positives.

The Agentic Era in IT Operations

HPE introduces the concept of an 'Agentic Era' where AI agents possess specialized knowledge and skills to perform goal-oriented reasoning and autonomous actions, always with a human in the loop for orchestration and verification. These agents are designed to bridge data and operational silos, improving proactive operations and incident response. This approach contrasts with general LLMs by focusing on domain-specific 'skills' to ensure accuracy and reduce hallucinations.

Persona-based explainability: Overcoming operational silos by providing context-aware insights.
Data silo reduction: Bridging disparate datasets and minimizing data duplication.
Proactive operations: Utilizing multi-variate predictive analytics for adaptive thresholds and early warning.
Reduced operator burnout: Automating tedious tasks and reducing alert noise.
Enhanced auditability: Tracking changes and providing transparent reasoning for actions.

Architecture of Agentic Root Cause Analysis

Agentic AI for root cause analysis operates on a feedback loop similar to OODA (Observe, Orient, Decide, Act). When an issue arises, the agent generates hypotheses, dispatches specialized skills (e.g., 'trace analysis skill' for microservices, 'metrics analysis skill' for patterns), and synthesizes a narrative identifying likely culprits and ruled-out possibilities. Crucially, the system tracks changes and leverages cross-organizational memory to identify correlations, significantly cutting down investigation time.

ℹ️

Key Components for Trust in Agentic AI

For enterprise-grade adoption, agentic AI systems require robust mechanisms for trust and transparency. This includes a full audit trail (conversations, user attribution, API calls), transparent reasoning (showing hypotheses, step-by-step plans, cited sources), and observability/traceability (OpenTelemetry traces, decision path logging, reproducible evaluations). These ensure accountability and allow human operators to build confidence in the AI's recommendations.

AI agentsIT OperationsRoot Cause AnalysisSREDevOpsObservabilityIncident ResponseHybrid Cloud

Comments

Loading comments...

Architecture Design

Design this yourself

Design an enterprise-grade, multi-domain agentic operations system that leverages AI agents for proactive IT operations and significantly reduces root cause analysis time. Your design should incorporate specialized AI skills, maintain a human-in-the-loop orchestration model, ensure full auditability and transparent reasoning, and integrate with existing observability and incident management platforms to handle complex hybrid cloud environments. Focus on how agents observe, orient, decide, and suggest actions, while human operators retain final control over remediation.

Practice Interview

Focus: AI-powered agentic system for root cause analysis and proactive operations with auditability

Other design angles

· Design the auditability and explainability framework for an AI-powered operations agent, ensuring compliance and operator trust in automated root cause analysis.· Design a 'skill dispatch' and hypothesis generation engine for an agentic AI system, optimizing for quick and accurate problem identification across distributed microservices and infrastructure.· Architect a multi-tenant platform for managing and orchestrating various specialized AI agents, ensuring secure data isolation and efficient resource allocation for different enterprise operational teams.

Leveraging AI Agents for Proactive IT Operations and Root Cause Analysis

The Agentic Era in IT Operations

Architecture of Agentic Root Cause Analysis

Comments

Architecture Design

Related Lessons