Datadog Blog·May 29, 2026

Observability for LLM Agents: Monitoring LangGraph with Datadog

This article discusses the importance of observability for LangGraph-based LLM agents, focusing on how Datadog's LLM Observability SDK can be used for tracing, logging, and metrics. It highlights key system design considerations for monitoring complex AI workflows, including understanding agent states, tool usage, and overall performance to ensure reliable and efficient operation.

AI & ML Infrastructure DevOps & SRE Distributed Systems

Read original on Datadog Blog

The Need for Observability in LLM Agents

Monitoring LLM-powered applications, especially those built with agent frameworks like LangGraph, presents unique challenges compared to traditional microservices. LLM agents often involve complex, non-deterministic execution flows, external tool interactions, and multiple decision points. Effective observability is crucial for debugging, performance optimization, and understanding the emergent behavior of these systems in production.

Key Observability Pillars for LangGraph

Tracing: Following the execution path of an agent across different nodes (e.g., LLM calls, tool invocations, state transitions) provides a holistic view of its operation.
Logging: Capturing detailed events, inputs, and outputs at each step is essential for post-mortem analysis and debugging specific failures.
Metrics: Aggregating quantitative data like latency of LLM calls, success rates of tool usage, and token consumption helps in identifying bottlenecks and overall system health.
Agent State Monitoring: Understanding the internal state and decision-making process of the agent is vital for diagnosing unexpected behavior and ensuring it adheres to design principles.

💡

Designing for LLM Observability

When designing systems with LLM agents, integrate observability from the ground up. This involves instrumenting LLM calls, tool interactions, and state changes to generate rich telemetry. Consider how this data will be aggregated, visualized, and alerted upon to provide actionable insights into the agent's performance and reliability.

The article demonstrates using Datadog's LLM Observability SDK, which automatically instruments popular LLM frameworks, to collect traces, logs, and metrics. This includes generating detailed spans for LLM calls, tool executions, and even prompt and response data, allowing developers to visualize the entire workflow in a distributed tracing system. This automated instrumentation reduces the overhead of manually adding telemetry to complex agent logic, fostering better visibility into the system's runtime behavior.

python

from datadog_llm.patch import patch
patch()

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI

# Define the LangGraph agent
graph_builder = StateGraph(AgentState)
graph_builder.add_node("llm", call_model)
graph_builder.add_node("tool_use", use_tool)
# ... more graph definition

# When `patch()` is called, LLM calls and tool uses will be automatically traced.

LLMObservabilityMonitoringLangGraphDatadogTracingAI AgentSystem Health

Comments

Loading comments...

Architecture Design

Design this yourself

Design a robust, observable AI agent platform using LangGraph, incorporating comprehensive tracing, logging, and metric collection for LLM interactions, tool usage, and state transitions. Focus on how to instrument the agent for deep visibility and integrate with a monitoring system to provide real-time performance insights and facilitate debugging of complex, non-deterministic workflows.

Practice Interview

Focus: observability for LLM agents

Other design angles

· Design a system to monitor the cost and token usage of a multi-agent LLM application in real-time.· Design a feedback loop mechanism for an LLM agent system, using observability data to inform model fine-tuning and prompt engineering.· Design a distributed logging and tracing infrastructure specifically optimized for the unique telemetry patterns of LLM-powered microservices.

Observability for LLM Agents: Monitoring LangGraph with Datadog

The Need for Observability in LLM Agents

Key Observability Pillars for LangGraph

Comments

Architecture Design

Related Lessons