Menu
Datadog Blog·May 29, 2026

Observability for LLM Agents: Monitoring LangGraph with Datadog

This article discusses the importance of observability for LangGraph-based LLM agents, focusing on how Datadog's LLM Observability SDK can be used for tracing, logging, and metrics. It highlights key system design considerations for monitoring complex AI workflows, including understanding agent states, tool usage, and overall performance to ensure reliable and efficient operation.

Read original on Datadog Blog

The Need for Observability in LLM Agents

Monitoring LLM-powered applications, especially those built with agent frameworks like LangGraph, presents unique challenges compared to traditional microservices. LLM agents often involve complex, non-deterministic execution flows, external tool interactions, and multiple decision points. Effective observability is crucial for debugging, performance optimization, and understanding the emergent behavior of these systems in production.

Key Observability Pillars for LangGraph

  • Tracing: Following the execution path of an agent across different nodes (e.g., LLM calls, tool invocations, state transitions) provides a holistic view of its operation.
  • Logging: Capturing detailed events, inputs, and outputs at each step is essential for post-mortem analysis and debugging specific failures.
  • Metrics: Aggregating quantitative data like latency of LLM calls, success rates of tool usage, and token consumption helps in identifying bottlenecks and overall system health.
  • Agent State Monitoring: Understanding the internal state and decision-making process of the agent is vital for diagnosing unexpected behavior and ensuring it adheres to design principles.
💡

Designing for LLM Observability

When designing systems with LLM agents, integrate observability from the ground up. This involves instrumenting LLM calls, tool interactions, and state changes to generate rich telemetry. Consider how this data will be aggregated, visualized, and alerted upon to provide actionable insights into the agent's performance and reliability.

The article demonstrates using Datadog's LLM Observability SDK, which automatically instruments popular LLM frameworks, to collect traces, logs, and metrics. This includes generating detailed spans for LLM calls, tool executions, and even prompt and response data, allowing developers to visualize the entire workflow in a distributed tracing system. This automated instrumentation reduces the overhead of manually adding telemetry to complex agent logic, fostering better visibility into the system's runtime behavior.

python
from datadog_llm.patch import patch
patch()

from langgraph.graph import StateGraph, START, END
from langchain_openai import ChatOpenAI

# Define the LangGraph agent
graph_builder = StateGraph(AgentState)
graph_builder.add_node("llm", call_model)
graph_builder.add_node("tool_use", use_tool)
# ... more graph definition

# When `patch()` is called, LLM calls and tool uses will be automatically traced.
LLMObservabilityMonitoringLangGraphDatadogTracingAI AgentSystem Health

Comments

Loading comments...