The Observability Gap
You’ve deployed an AI agent to production. Users are chatting with it. Sometimes they’re happy. Sometimes they’re not.
How do you know what’s going wrong?
You check your APM dashboard. Response times look fine. Error rates are low. CPU and memory are normal.
But users are complaining. The agent gave wrong information. It called the wrong tool. It hallucinated a policy that doesn’t exist.
Your traditional observability tools didn’t catch any of this. Because they weren’t designed for AI.
Why AI Agents Are Different
Traditional applications have predictable behavior. Given input A, they produce output B. If something goes wrong, you can trace the code path, examine variables, and identify the bug.
AI agents are different:
Non-Deterministic Outputs: The same input can produce different outputs. “Working correctly” isn’t binary — it’s a quality spectrum.
Quality Is Subjective: A response can be technically successful (HTTP 200, no exceptions) but completely wrong for the user’s needs.
Multi-Component Pipelines: A single user message might trigger: prompt assembly -> LLM call -> tool execution -> another LLM call -> response formatting. Issues can occur at any stage.
External Dependencies: LLM APIs are third-party services with their own reliability characteristics, rate limits, and cost implications.
Conversation Context: Problems might only appear in the context of a multi-turn conversation, not individual requests.
Traditional APM tools — Datadog, New Relic, Dynatrace — weren’t designed for this. They can tell you the request succeeded. They can’t tell you the response was helpful.
What AI Observability Requires
1. Conversation-Level Tracing
Individual request tracing isn’t enough. You need traces that span entire conversations:
Conversation: conv_abc123
+-- Turn 1: "What's my account balance?"
| +-- Prompt assembly (2ms)
| +-- LLM call (340ms) - 150 tokens in, 45 tokens out
| +-- Tool: lookup_account (120ms)
| +-- LLM call (280ms) - 200 tokens in, 60 tokens out
| +-- Response delivered
|
+-- Turn 2: "Transfer $500 to savings"
| +-- Prompt assembly (3ms)
| +-- LLM call (310ms)
| +-- Tool: validate_transfer (80ms)
| +-- Tool: execute_transfer (450ms) <-- FAILED
| +-- Error response
|
+-- Turn 3: "Why did that fail?"
+-- ...
The issue in Turn 2 only makes sense in the context of the conversation. Isolated request traces miss this.
2. LLM-Specific Metrics
Beyond standard latency and error rates, AI agents need:
Token Metrics:
- Input tokens per request
- Output tokens per request
- Cached tokens (for prompt caching)
- Context window utilization
Cost Metrics:
- Cost per LLM call
- Cost per conversation
- Cost per agent
- Cost by provider/model
Quality Signals:
- Tool call success/failure rates
- Guardrail trigger rates
- Response length distribution
- Conversation completion rates
3. Tool Execution Visibility
When an agent calls a tool, you need to see:
- Which tool was called
- What arguments were passed
- What result was returned
- How long it took
- Whether it succeeded or failed
Tool Execution: lookup_customer
Arguments: {"customer_id": "cust_12345"}
Duration: 145ms
Status: Success
Result: {"name": "John Doe", "status": "active", ...}
This is crucial for debugging. Was the problem the LLM’s decision to call the tool? The arguments it chose? Or the tool execution itself?
4. Prompt Versioning and Correlation
When you update prompts, you need to correlate quality changes:
Prompt: support-v2.3.1
Deployed: 2025-01-15 14:30
Quality Score: 4.2/5
Tool Accuracy: 94%
Average Cost: $0.012/conversation
Prompt: support-v2.4.0
Deployed: 2025-01-20 09:00
Quality Score: 3.8/5 <-- REGRESSION
Tool Accuracy: 87% <-- REGRESSION
Average Cost: $0.018/conversation
Without this correlation, you’re flying blind after every prompt change.
5. Session Replay
When something goes wrong, you need to see exactly what happened:
- The user’s messages
- The system prompts at each turn
- The LLM’s reasoning (if available)
- Tool calls and results
- The final response
Session replay lets you reconstruct the conversation and understand failures in context.
The Observability Stack for AI
A complete AI observability stack includes several layers:
Layer 1: Infrastructure Metrics
Standard metrics you already collect:
- CPU, memory, network utilization
- Request latency (p50, p95, p99)
- Error rates
- Pod health, replica counts
Tools: Prometheus, Grafana, your existing APM
Layer 2: AI-Specific Metrics
Custom metrics for AI workloads:
# Example Prometheus metrics
omnia_agent_tokens_input_total{agent="support", provider="anthropic"}
omnia_agent_tokens_output_total{agent="support", provider="anthropic"}
omnia_agent_cost_usd_total{agent="support", model="claude-sonnet"}
omnia_agent_tool_calls_total{agent="support", tool="lookup_customer", status="success"}
omnia_agent_conversations_active{agent="support"}
Tools: Custom exporters, AI-specific instrumentation
Layer 3: Distributed Tracing
OpenTelemetry traces with AI-specific spans:
Span: conversation.turn
+-- omnia.session_id: "sess_abc123"
+-- omnia.turn_number: 2
|
+-- Span: llm.call
| +-- llm.provider: "anthropic"
| +-- llm.model: "claude-sonnet-4-20250514"
| +-- llm.input_tokens: 450
| +-- llm.output_tokens: 120
| +-- llm.cost_usd: 0.0034
|
+-- Span: tool.lookup_customer
+-- tool.name: "lookup_customer"
+-- tool.duration_ms: 145
+-- tool.is_error: false
+-- tool.result_size: 1240
Tools: OpenTelemetry, Tempo, Jaeger, Honeycomb
Layer 4: Log Aggregation
Structured logs that enable conversation reconstruction:
{
"level": "info",
"session_id": "sess_abc123",
"turn": 2,
"event": "llm_response",
"model": "claude-sonnet-4-20250514",
"response_preview": "I can help you transfer $500...",
"tokens": {"input": 450, "output": 120},
"duration_ms": 340
}
Tools: Loki, Elasticsearch, your existing log aggregation
Layer 5: AI Quality Monitoring
Specialized monitoring for AI quality:
- Response quality scoring (automated or sampled)
- Guardrail violation tracking
- Hallucination detection
- Conversation outcome tracking
Tools: Langfuse (open source), Arize Phoenix (open source), Weights & Biases, Helicone
The Cost Dimension
LLM costs can spiral quickly. Observability must include cost tracking:
Per-Agent Cost Breakdown
Agent: customer-support
Provider: Anthropic
Model: claude-sonnet-4-20250514
24h Cost: $847.32
Input tokens: 12.4M ($4.96)
Output tokens: 3.2M ($48.00)
Cache savings: -$12.40
Requests: 23,450
Avg cost/request: $0.036
Cost Anomaly Detection
Alert when costs exceed expected bounds:
alert: HighAgentCost
expr: rate(omnia_agent_cost_usd_total[1h]) > 100
annotations:
summary: "Agent {{ $labels.agent }} cost exceeding $100/hour"
Cost Attribution
Know exactly which agents, users, and use cases drive costs:
Top cost drivers (24h):
1. support-agent-prod: $847.32 (23,450 conversations)
2. sales-agent-prod: $312.18 (8,920 conversations)
3. internal-qa-agent: $156.90 (1,200 conversations) <-- suspicious
Building the Dashboard
An effective AI observability dashboard shows:
Overview Panel
- Active conversations
- Total cost (24h, projected monthly)
- Token usage breakdown
- Error rate trend
Agent Health
- Per-agent metrics (latency, errors, cost)
- Deployment status
- Recent changes (prompt versions, config)
Quality Signals
- Tool call success rates
- Guardrail triggers
- Conversation completion rates
- Response quality scores (if available)
Cost Intelligence
- Cost by agent, provider, model
- Cost trends over time
- Anomaly highlights
Session Explorer
- Search conversations by ID, user, time range
- Replay specific sessions
- View tool calls and LLM interactions
The Integration Reality
You probably already have observability infrastructure. The question is how to extend it for AI:
Option 1: Extend Existing Tools
Add AI-specific instrumentation to your current stack:
- Custom Prometheus exporters for AI metrics
- OpenTelemetry SDK for AI-specific spans
- Structured logging for conversation events
- Grafana dashboards for AI views
Pros: Unified observability, no new tools Cons: AI-specific features require custom development
Option 2: AI-Specific Observability Platforms
Use specialized tools designed for LLM applications:
- Langfuse (open source)
- Arize Phoenix (open source)
- Weights & Biases
- Helicone
Pros: Purpose-built for AI, faster time-to-value Cons: Another tool in the stack, potential data duplication
Option 3: Hybrid Approach
Infrastructure metrics in existing tools, AI-specific observability in specialized platforms:
- Prometheus/Grafana for infrastructure
- Langfuse for conversation tracing and quality
- Custom cost tracking
Pros: Best of both worlds Cons: Integration complexity
The Path Forward
AI observability isn’t optional. Without it, you’re operating blind:
- You don’t know when quality degrades
- You can’t debug user complaints effectively
- You can’t correlate prompt changes with outcomes
- You can’t control costs
The good news: the building blocks exist. OpenTelemetry is adding AI semantic conventions. Langfuse and Arize provide open-source options. Your existing Prometheus/Grafana stack can be extended.
The question isn’t whether you need AI observability. It’s how quickly you can implement it.
Key Takeaways
- Traditional APM tools don’t capture AI-specific issues — quality, cost, conversation context
- Conversation-level tracing is essential — individual requests don’t tell the full story
- AI-specific metrics include tokens, costs, tool execution, and quality signals
- Cost observability is critical — LLM costs can spiral without visibility
- Session replay enables effective debugging of AI failures
- Hybrid approaches often work best — extend existing tools + add AI-specific platforms