Observability for AI Agents: What Traditional APM Tools Miss

The Observability Gap

You’ve deployed an AI agent to production. Users are chatting with it. Sometimes they’re happy. Sometimes they’re not.

How do you know what’s going wrong?

You check your APM dashboard. Response times look fine. Error rates are low. CPU and memory are normal.

But users are complaining. The agent gave wrong information. It called the wrong tool. It hallucinated a policy that doesn’t exist.

Your traditional observability tools didn’t catch any of this. Because they weren’t designed for AI.

Why AI Agents Are Different

Traditional applications have predictable behavior. Given input A, they produce output B. If something goes wrong, you can trace the code path, examine variables, and identify the bug.

AI agents are different:

Non-Deterministic Outputs: The same input can produce different outputs. “Working correctly” isn’t binary — it’s a quality spectrum.

Quality Is Subjective: A response can be technically successful (HTTP 200, no exceptions) but completely wrong for the user’s needs.

Multi-Component Pipelines: A single user message might trigger: prompt assembly -> LLM call -> tool execution -> another LLM call -> response formatting. Issues can occur at any stage.

External Dependencies: LLM APIs are third-party services with their own reliability characteristics, rate limits, and cost implications.

Conversation Context: Problems might only appear in the context of a multi-turn conversation, not individual requests.

Traditional APM tools — Datadog, New Relic, Dynatrace — weren’t designed for this. They can tell you the request succeeded. They can’t tell you the response was helpful.

What AI Observability Requires

1. Conversation-Level Tracing

Individual request tracing isn’t enough. You need traces that span entire conversations:

Conversation: conv_abc123
  +-- Turn 1: "What's my account balance?"
  |   +-- Prompt assembly (2ms)
  |   +-- LLM call (340ms) - 150 tokens in, 45 tokens out
  |   +-- Tool: lookup_account (120ms)
  |   +-- LLM call (280ms) - 200 tokens in, 60 tokens out
  |   +-- Response delivered
  |
  +-- Turn 2: "Transfer $500 to savings"
  |   +-- Prompt assembly (3ms)
  |   +-- LLM call (310ms)
  |   +-- Tool: validate_transfer (80ms)
  |   +-- Tool: execute_transfer (450ms) <-- FAILED
  |   +-- Error response
  |
  +-- Turn 3: "Why did that fail?"
      +-- ...

The issue in Turn 2 only makes sense in the context of the conversation. Isolated request traces miss this.

2. LLM-Specific Metrics

Beyond standard latency and error rates, AI agents need:

Token Metrics:

Input tokens per request
Output tokens per request
Cached tokens (for prompt caching)
Context window utilization

Cost Metrics:

Cost per LLM call
Cost per conversation
Cost per agent
Cost by provider/model

Quality Signals:

Tool call success/failure rates
Guardrail trigger rates
Response length distribution
Conversation completion rates

3. Tool Execution Visibility

When an agent calls a tool, you need to see:

Which tool was called
What arguments were passed
What result was returned
How long it took
Whether it succeeded or failed

Tool Execution: lookup_customer
  Arguments: {"customer_id": "cust_12345"}
  Duration: 145ms
  Status: Success
  Result: {"name": "John Doe", "status": "active", ...}

This is crucial for debugging. Was the problem the LLM’s decision to call the tool? The arguments it chose? Or the tool execution itself?

4. Prompt Versioning and Correlation

When you update prompts, you need to correlate quality changes:

Prompt: support-v2.3.1
  Deployed: 2025-01-15 14:30
  Quality Score: 4.2/5
  Tool Accuracy: 94%
  Average Cost: $0.012/conversation

Prompt: support-v2.4.0
  Deployed: 2025-01-20 09:00
  Quality Score: 3.8/5 <-- REGRESSION
  Tool Accuracy: 87% <-- REGRESSION
  Average Cost: $0.018/conversation

Without this correlation, you’re flying blind after every prompt change.

5. Session Replay

When something goes wrong, you need to see exactly what happened:

The user’s messages
The system prompts at each turn
The LLM’s reasoning (if available)
Tool calls and results
The final response

Session replay lets you reconstruct the conversation and understand failures in context.

The Observability Stack for AI

A complete AI observability stack includes several layers:

Layer 1: Infrastructure Metrics

Standard metrics you already collect:

CPU, memory, network utilization
Request latency (p50, p95, p99)
Error rates
Pod health, replica counts

Tools: Prometheus, Grafana, your existing APM

Layer 2: AI-Specific Metrics

Custom metrics for AI workloads:

# Example Prometheus metrics
omnia_agent_tokens_input_total{agent="support", provider="anthropic"}
omnia_agent_tokens_output_total{agent="support", provider="anthropic"}
omnia_agent_cost_usd_total{agent="support", model="claude-sonnet"}
omnia_agent_tool_calls_total{agent="support", tool="lookup_customer", status="success"}
omnia_agent_conversations_active{agent="support"}

Tools: Custom exporters, AI-specific instrumentation

Layer 3: Distributed Tracing

OpenTelemetry traces with AI-specific spans:

Span: conversation.turn
  +-- omnia.session_id: "sess_abc123"
  +-- omnia.turn_number: 2
  |
  +-- Span: llm.call
  |   +-- llm.provider: "anthropic"
  |   +-- llm.model: "claude-sonnet-4-20250514"
  |   +-- llm.input_tokens: 450
  |   +-- llm.output_tokens: 120
  |   +-- llm.cost_usd: 0.0034
  |
  +-- Span: tool.lookup_customer
      +-- tool.name: "lookup_customer"
      +-- tool.duration_ms: 145
      +-- tool.is_error: false
      +-- tool.result_size: 1240

Tools: OpenTelemetry, Tempo, Jaeger, Honeycomb

Layer 4: Log Aggregation

Structured logs that enable conversation reconstruction:

{
  "level": "info",
  "session_id": "sess_abc123",
  "turn": 2,
  "event": "llm_response",
  "model": "claude-sonnet-4-20250514",
  "response_preview": "I can help you transfer $500...",
  "tokens": {"input": 450, "output": 120},
  "duration_ms": 340
}

Tools: Loki, Elasticsearch, your existing log aggregation

Layer 5: AI Quality Monitoring

Specialized monitoring for AI quality:

Response quality scoring (automated or sampled)
Guardrail violation tracking
Hallucination detection
Conversation outcome tracking

Tools: Langfuse (open source), Arize Phoenix (open source), Weights & Biases, Helicone

The Cost Dimension

LLM costs can spiral quickly. Observability must include cost tracking:

Per-Agent Cost Breakdown

Agent: customer-support
  Provider: Anthropic
  Model: claude-sonnet-4-20250514
  24h Cost: $847.32
    Input tokens: 12.4M ($4.96)
    Output tokens: 3.2M ($48.00)
    Cache savings: -$12.40
  Requests: 23,450
  Avg cost/request: $0.036

Cost Anomaly Detection

Alert when costs exceed expected bounds:

alert: HighAgentCost
expr: rate(omnia_agent_cost_usd_total[1h]) > 100
annotations:
  summary: "Agent {{ $labels.agent }} cost exceeding $100/hour"

Cost Attribution

Know exactly which agents, users, and use cases drive costs:

Top cost drivers (24h):
1. support-agent-prod: $847.32 (23,450 conversations)
2. sales-agent-prod: $312.18 (8,920 conversations)
3. internal-qa-agent: $156.90 (1,200 conversations) <-- suspicious

Building the Dashboard

An effective AI observability dashboard shows:

Overview Panel

Active conversations
Total cost (24h, projected monthly)
Token usage breakdown
Error rate trend

Agent Health

Per-agent metrics (latency, errors, cost)
Deployment status
Recent changes (prompt versions, config)

Quality Signals

Tool call success rates
Guardrail triggers
Conversation completion rates
Response quality scores (if available)

Cost Intelligence

Cost by agent, provider, model
Cost trends over time
Anomaly highlights

Session Explorer

Search conversations by ID, user, time range
Replay specific sessions
View tool calls and LLM interactions

The Integration Reality

You probably already have observability infrastructure. The question is how to extend it for AI:

Option 1: Extend Existing Tools

Add AI-specific instrumentation to your current stack:

Custom Prometheus exporters for AI metrics
OpenTelemetry SDK for AI-specific spans
Structured logging for conversation events
Grafana dashboards for AI views

Pros: Unified observability, no new tools Cons: AI-specific features require custom development

Option 2: AI-Specific Observability Platforms

Use specialized tools designed for LLM applications:

Langfuse (open source)
Arize Phoenix (open source)
Weights & Biases
Helicone

Pros: Purpose-built for AI, faster time-to-value Cons: Another tool in the stack, potential data duplication

Option 3: Hybrid Approach

Infrastructure metrics in existing tools, AI-specific observability in specialized platforms:

Prometheus/Grafana for infrastructure
Langfuse for conversation tracing and quality
Custom cost tracking

Pros: Best of both worlds Cons: Integration complexity

The Path Forward

AI observability isn’t optional. Without it, you’re operating blind:

You don’t know when quality degrades
You can’t debug user complaints effectively
You can’t correlate prompt changes with outcomes
You can’t control costs

The good news: the building blocks exist. OpenTelemetry is adding AI semantic conventions. Langfuse and Arize provide open-source options. Your existing Prometheus/Grafana stack can be extended.

The question isn’t whether you need AI observability. It’s how quickly you can implement it.

Key Takeaways

Traditional APM tools don’t capture AI-specific issues — quality, cost, conversation context
Conversation-level tracing is essential — individual requests don’t tell the full story
AI-specific metrics include tokens, costs, tool execution, and quality signals
Cost observability is critical — LLM costs can spiral without visibility
Session replay enables effective debugging of AI failures
Hybrid approaches often work best — extend existing tools + add AI-specific platforms