Menu
The New Stack·March 25, 2026

Leveraging AI Agents for Proactive IT Operations and Root Cause Analysis

This article explores how HPE's AI agents are transforming IT operations by reducing operational fatigue and accelerating root cause analysis. It focuses on the shift from human-centric incident response to an 'Agentic Era' where specialized AI agents collaborate with human operators, improving efficiency and reducing burnout in complex, hybrid cloud environments. The core system design challenge lies in building auditable, explainable, and trustworthy AI-driven operational systems.

Read original on The New Stack

The increasing complexity of enterprise IT environments, especially in hybrid and multi-cloud setups, leads to significant operational fatigue, alert sprawl, and burnout among SRE and DevOps teams. Traditional manual incident response struggles to keep pace with the volume and speed of changes, including those introduced by AI-produced code. This scenario highlights a critical need for advanced operational tools that can augment human capabilities without introducing more noise or false positives.

The Agentic Era in IT Operations

HPE introduces the concept of an 'Agentic Era' where AI agents possess specialized knowledge and skills to perform goal-oriented reasoning and autonomous actions, always with a human in the loop for orchestration and verification. These agents are designed to bridge data and operational silos, improving proactive operations and incident response. This approach contrasts with general LLMs by focusing on domain-specific 'skills' to ensure accuracy and reduce hallucinations.

  • Persona-based explainability: Overcoming operational silos by providing context-aware insights.
  • Data silo reduction: Bridging disparate datasets and minimizing data duplication.
  • Proactive operations: Utilizing multi-variate predictive analytics for adaptive thresholds and early warning.
  • Reduced operator burnout: Automating tedious tasks and reducing alert noise.
  • Enhanced auditability: Tracking changes and providing transparent reasoning for actions.

Architecture of Agentic Root Cause Analysis

Agentic AI for root cause analysis operates on a feedback loop similar to OODA (Observe, Orient, Decide, Act). When an issue arises, the agent generates hypotheses, dispatches specialized skills (e.g., 'trace analysis skill' for microservices, 'metrics analysis skill' for patterns), and synthesizes a narrative identifying likely culprits and ruled-out possibilities. Crucially, the system tracks changes and leverages cross-organizational memory to identify correlations, significantly cutting down investigation time.

ℹ️

Key Components for Trust in Agentic AI

For enterprise-grade adoption, agentic AI systems require robust mechanisms for trust and transparency. This includes a full audit trail (conversations, user attribution, API calls), transparent reasoning (showing hypotheses, step-by-step plans, cited sources), and observability/traceability (OpenTelemetry traces, decision path logging, reproducible evaluations). These ensure accountability and allow human operators to build confidence in the AI's recommendations.

AI agentsIT OperationsRoot Cause AnalysisSREDevOpsObservabilityIncident ResponseHybrid Cloud

Comments

Loading comments...