The Observability Gap

You’ve deployed an AI agent to production. Users are chatting with it. Sometimes they’re happy. Sometimes they’re not.

How do you know what’s going wrong?

You check your APM dashboard. Response times look fine. Error rates are low. CPU and memory are normal.

But users are complaining. The agent gave wrong information. It called the wrong tool. It hallucinated a policy that doesn’t exist.

Your traditional observability tools didn’t catch any of this. Because they weren’t designed for AI.

Why AI Agents Are Different

Traditional applications have predictable behavior. Given input A, they produce output B. If something goes wrong, you can trace the code path, examine variables, and identify the bug.

AI agents are different:

Non-Deterministic Outputs: The same input can produce different outputs. “Working correctly” isn’t binary — it’s a quality spectrum.

Quality Is Subjective: A response can be technically successful (HTTP 200, no exceptions) but completely wrong for the user’s needs.

Multi-Component Pipelines: A single user message might trigger: prompt assembly -> LLM call -> tool execution -> another LLM call -> response formatting. Issues can occur at any stage.

External Dependencies: LLM APIs are third-party services with their own reliability characteristics, rate limits, and cost implications.

Conversation Context: Problems might only appear in the context of a multi-turn conversation, not individual requests.

Traditional APM tools — Datadog, New Relic, Dynatrace — weren’t designed for this. They can tell you the request succeeded. They can’t tell you the response was helpful.

What AI Observability Requires

1. Conversation-Level Tracing

Individual request tracing isn’t enough. You need traces that span entire conversations:

Conversation: conv_abc123
  +-- Turn 1: "What's my account balance?"
  |   +-- Prompt assembly (2ms)
  |   +-- LLM call (340ms) - 150 tokens in, 45 tokens out
  |   +-- Tool: lookup_account (120ms)
  |   +-- LLM call (280ms) - 200 tokens in, 60 tokens out
  |   +-- Response delivered
  |
  +-- Turn 2: "Transfer $500 to savings"
  |   +-- Prompt assembly (3ms)
  |   +-- LLM call (310ms)
  |   +-- Tool: validate_transfer (80ms)
  |   +-- Tool: execute_transfer (450ms) <-- FAILED
  |   +-- Error response
  |
  +-- Turn 3: "Why did that fail?"
      +-- ...

The issue in Turn 2 only makes sense in the context of the conversation. Isolated request traces miss this.

2. LLM-Specific Metrics

Beyond standard latency and error rates, AI agents need:

Token Metrics:

  • Input tokens per request
  • Output tokens per request
  • Cached tokens (for prompt caching)
  • Context window utilization

Cost Metrics:

  • Cost per LLM call
  • Cost per conversation
  • Cost per agent
  • Cost by provider/model

Quality Signals:

  • Tool call success/failure rates
  • Guardrail trigger rates
  • Response length distribution
  • Conversation completion rates

3. Tool Execution Visibility

When an agent calls a tool, you need to see:

  • Which tool was called
  • What arguments were passed
  • What result was returned
  • How long it took
  • Whether it succeeded or failed
Tool Execution: lookup_customer
  Arguments: {"customer_id": "cust_12345"}
  Duration: 145ms
  Status: Success
  Result: {"name": "John Doe", "status": "active", ...}

This is crucial for debugging. Was the problem the LLM’s decision to call the tool? The arguments it chose? Or the tool execution itself?

4. Prompt Versioning and Correlation

When you update prompts, you need to correlate quality changes:

Prompt: support-v2.3.1
  Deployed: 2025-01-15 14:30
  Quality Score: 4.2/5
  Tool Accuracy: 94%
  Average Cost: $0.012/conversation

Prompt: support-v2.4.0
  Deployed: 2025-01-20 09:00
  Quality Score: 3.8/5 <-- REGRESSION
  Tool Accuracy: 87% <-- REGRESSION
  Average Cost: $0.018/conversation

Without this correlation, you’re flying blind after every prompt change.

5. Session Replay

When something goes wrong, you need to see exactly what happened:

  • The user’s messages
  • The system prompts at each turn
  • The LLM’s reasoning (if available)
  • Tool calls and results
  • The final response

Session replay lets you reconstruct the conversation and understand failures in context.

The Observability Stack for AI

A complete AI observability stack includes several layers:

Layer 1: Infrastructure Metrics

Standard metrics you already collect:

  • CPU, memory, network utilization
  • Request latency (p50, p95, p99)
  • Error rates
  • Pod health, replica counts

Tools: Prometheus, Grafana, your existing APM

Layer 2: AI-Specific Metrics

Custom metrics for AI workloads:

# Example Prometheus metrics
omnia_agent_tokens_input_total{agent="support", provider="anthropic"}
omnia_agent_tokens_output_total{agent="support", provider="anthropic"}
omnia_agent_cost_usd_total{agent="support", model="claude-sonnet"}
omnia_agent_tool_calls_total{agent="support", tool="lookup_customer", status="success"}
omnia_agent_conversations_active{agent="support"}

Tools: Custom exporters, AI-specific instrumentation

Layer 3: Distributed Tracing

OpenTelemetry traces with AI-specific spans:

Span: conversation.turn
  +-- omnia.session_id: "sess_abc123"
  +-- omnia.turn_number: 2
  |
  +-- Span: llm.call
  |   +-- llm.provider: "anthropic"
  |   +-- llm.model: "claude-sonnet-4-20250514"
  |   +-- llm.input_tokens: 450
  |   +-- llm.output_tokens: 120
  |   +-- llm.cost_usd: 0.0034
  |
  +-- Span: tool.lookup_customer
      +-- tool.name: "lookup_customer"
      +-- tool.duration_ms: 145
      +-- tool.is_error: false
      +-- tool.result_size: 1240

Tools: OpenTelemetry, Tempo, Jaeger, Honeycomb

Layer 4: Log Aggregation

Structured logs that enable conversation reconstruction:

{
  "level": "info",
  "session_id": "sess_abc123",
  "turn": 2,
  "event": "llm_response",
  "model": "claude-sonnet-4-20250514",
  "response_preview": "I can help you transfer $500...",
  "tokens": {"input": 450, "output": 120},
  "duration_ms": 340
}

Tools: Loki, Elasticsearch, your existing log aggregation

Layer 5: AI Quality Monitoring

Specialized monitoring for AI quality:

  • Response quality scoring (automated or sampled)
  • Guardrail violation tracking
  • Hallucination detection
  • Conversation outcome tracking

Tools: Langfuse (open source), Arize Phoenix (open source), Weights & Biases, Helicone

The Cost Dimension

LLM costs can spiral quickly. Observability must include cost tracking:

Per-Agent Cost Breakdown

Agent: customer-support
  Provider: Anthropic
  Model: claude-sonnet-4-20250514
  24h Cost: $847.32
    Input tokens: 12.4M ($4.96)
    Output tokens: 3.2M ($48.00)
    Cache savings: -$12.40
  Requests: 23,450
  Avg cost/request: $0.036

Cost Anomaly Detection

Alert when costs exceed expected bounds:

alert: HighAgentCost
expr: rate(omnia_agent_cost_usd_total[1h]) > 100
annotations:
  summary: "Agent {{ $labels.agent }} cost exceeding $100/hour"

Cost Attribution

Know exactly which agents, users, and use cases drive costs:

Top cost drivers (24h):
1. support-agent-prod: $847.32 (23,450 conversations)
2. sales-agent-prod: $312.18 (8,920 conversations)
3. internal-qa-agent: $156.90 (1,200 conversations) <-- suspicious

Building the Dashboard

An effective AI observability dashboard shows:

Overview Panel

  • Active conversations
  • Total cost (24h, projected monthly)
  • Token usage breakdown
  • Error rate trend

Agent Health

  • Per-agent metrics (latency, errors, cost)
  • Deployment status
  • Recent changes (prompt versions, config)

Quality Signals

  • Tool call success rates
  • Guardrail triggers
  • Conversation completion rates
  • Response quality scores (if available)

Cost Intelligence

  • Cost by agent, provider, model
  • Cost trends over time
  • Anomaly highlights

Session Explorer

  • Search conversations by ID, user, time range
  • Replay specific sessions
  • View tool calls and LLM interactions

The Integration Reality

You probably already have observability infrastructure. The question is how to extend it for AI:

Option 1: Extend Existing Tools

Add AI-specific instrumentation to your current stack:

  • Custom Prometheus exporters for AI metrics
  • OpenTelemetry SDK for AI-specific spans
  • Structured logging for conversation events
  • Grafana dashboards for AI views

Pros: Unified observability, no new tools Cons: AI-specific features require custom development

Option 2: AI-Specific Observability Platforms

Use specialized tools designed for LLM applications:

  • Langfuse (open source)
  • Arize Phoenix (open source)
  • Weights & Biases
  • Helicone

Pros: Purpose-built for AI, faster time-to-value Cons: Another tool in the stack, potential data duplication

Option 3: Hybrid Approach

Infrastructure metrics in existing tools, AI-specific observability in specialized platforms:

  • Prometheus/Grafana for infrastructure
  • Langfuse for conversation tracing and quality
  • Custom cost tracking

Pros: Best of both worlds Cons: Integration complexity

The Path Forward

AI observability isn’t optional. Without it, you’re operating blind:

  • You don’t know when quality degrades
  • You can’t debug user complaints effectively
  • You can’t correlate prompt changes with outcomes
  • You can’t control costs

The good news: the building blocks exist. OpenTelemetry is adding AI semantic conventions. Langfuse and Arize provide open-source options. Your existing Prometheus/Grafana stack can be extended.

The question isn’t whether you need AI observability. It’s how quickly you can implement it.


Key Takeaways

  1. Traditional APM tools don’t capture AI-specific issues — quality, cost, conversation context
  2. Conversation-level tracing is essential — individual requests don’t tell the full story
  3. AI-specific metrics include tokens, costs, tool execution, and quality signals
  4. Cost observability is critical — LLM costs can spiral without visibility
  5. Session replay enables effective debugging of AI failures
  6. Hybrid approaches often work best — extend existing tools + add AI-specific platforms