- Published on
- ·5 min read
AI Observability: The 4 Signals You Are Not Tracking
You've added latency monitoring. You have error rate alerts. You're using Application Insights.
And your AI system is still silently degrading.
The problem: standard APM tooling was designed for deterministic software. AI systems fail in non-deterministic ways that those tools can't detect. Here are the four signals that will show you what's actually happening.
Signal 1: Token Drift
Token usage is not just a cost metric. It's a quality signal.
When your average prompt token count increases over time without a corresponding increase in queries, something changed in your context assembly. Common causes:
- Longer documents being indexed
- Conversation history accumulating more turns than expected
- A bug in context trimming logic
- System prompt expanded by another developer
When token counts drop unexpectedly:
- Context trimming is cutting too aggressively
- A bug is dropping chunks before they reach the prompt
- Retrieval is returning fewer results than expected
What to track:
import logging
from dataclasses import dataclass
@dataclass
class TokenMetrics:
prompt_tokens: int
completion_tokens: int
total_tokens: int
context_chunks: int
conversation_turns: int
timestamp: str
def log_token_metrics(response, context_chunks: int, conversation_turns: int):
metrics = TokenMetrics(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens,
context_chunks=context_chunks,
conversation_turns=conversation_turns,
timestamp=datetime.utcnow().isoformat()
)
logging.info({"event": "token_metrics", **vars(metrics)})
Alert on:
- 7-day rolling average prompt tokens increases by >20%
- Single request exceeds 80% of your context limit
- Completion tokens consistently near
max_tokens(model may be truncating)
Signal 2: Latency Percentiles, Not Averages
Your average latency is 900ms. Looks fine.
Your P95 is 8 seconds. P99 is 22 seconds.
Those tail latencies represent real users staring at a spinner for 22 seconds. Average latency hides this completely.
Azure OpenAI latency has a fat tail because:
- Backend model servers under load spike unpredictably
- Large prompt tokens increase time-to-first-token significantly
- Streaming helps perceived latency but not actual generation time
What to track:
import time
from collections import defaultdict
import statistics
latency_samples = defaultdict(list)
def traced_call(model: str, prompt_tokens: int):
def decorator(fn):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = fn(*args, **kwargs)
elapsed_ms = (time.perf_counter() - start) * 1000
bucket = f"{(prompt_tokens // 1000) * 1000}k_tokens"
latency_samples[bucket].append(elapsed_ms)
logging.info({
"event": "ai_latency",
"model": model,
"latency_ms": round(elapsed_ms),
"prompt_tokens": prompt_tokens,
"bucket": bucket,
})
return result
return wrapper
return decorator
In Azure Monitor, create custom metrics queries:
customMetrics
| where name == "ai_latency_ms"
| summarize
p50=percentile(value, 50),
p95=percentile(value, 95),
p99=percentile(value, 99),
avg=avg(value)
by bin(timestamp, 1h)
| project timestamp, p50, p95, p99, avg
Alert on P95 > 5s, not on average > 2s. Averages always look good until they don't.
Signal 3: Cache Hit Ratio
Azure OpenAI supports prompt caching — if you send the exact same prefix in multiple requests, the model charges reduced rates for the cached portion. More importantly, cached prompts respond faster (by up to 50%).
Most teams don't know their cache hit ratio — so they can't tell whether prompt design is cache-friendly.
Signs of poor caching:
- System prompt changes frequently (kills cache hits)
- Retrieved chunks are injected at the beginning of the prompt (different chunks = different cache)
- User name or timestamp injected early in the prompt
Design for caching:
# BAD: User context early kills caching
messages = [
{"role": "system", "content": f"You are a helpful assistant for {user_name} at {datetime.now()}."},
{"role": "user", "content": f"Context:\n{retrieved_chunks}\n\nQuestion: {query}"}
]
# GOOD: Stable content first, dynamic content last
messages = [
{"role": "system", "content": "You are a helpful assistant for AzureFixes users. Answer based on the provided context."},
# Stable retrieved context (from common documents — more cacheable)
{"role": "user", "content": f"Context:\n{retrieved_chunks}\n\nUser: {user_name}\nQuestion: {query}"}
]
Azure OpenAI marks cached prompt tokens in the usage response:
response = client.chat.completions.create(...)
prompt_tokens_details = response.usage.prompt_tokens_details
cached = prompt_tokens_details.cached_tokens if prompt_tokens_details else 0
cache_hit_ratio = cached / response.usage.prompt_tokens
logging.info({"cache_hit_ratio": cache_hit_ratio, "cached_tokens": cached})
Track this weekly. If your cache hit ratio is under 30%, your prompt design is costing you money and latency.
Signal 4: Retrieval Quality Drift
This is the most dangerous signal to miss because the failure mode is invisible.
Retrieval quality drift happens when:
- Your document corpus changes (new documents indexed differently)
- Azure AI Search is re-indexed with different chunk sizes
- Your embedding model is updated
- Query patterns change (users ask different types of questions)
The AI app continues to return responses. The responses look reasonable. But the retrieved chunks are gradually less relevant, and the model is silently filling the gap with hallucinations.
How to detect it:
Set up a golden dataset — 50–100 representative questions where you know which documents should be retrieved. Run this as a scheduled evaluation:
def evaluate_retrieval(golden_set: list[dict], search_fn) -> dict:
results = []
for item in golden_set:
query = item["query"]
expected_doc_id = item["expected_doc_id"]
retrieved = search_fn(query, top_k=5)
top_ids = [r["id"] for r in retrieved]
hit_at_1 = expected_doc_id == top_ids[0] if top_ids else False
hit_at_5 = expected_doc_id in top_ids
results.append({"hit_at_1": hit_at_1, "hit_at_5": hit_at_5})
return {
"recall_at_1": sum(r["hit_at_1"] for r in results) / len(results),
"recall_at_5": sum(r["hit_at_5"] for r in results) / len(results),
"n": len(results),
"evaluated_at": datetime.utcnow().isoformat()
}
Run this evaluation daily in CI/CD. Alert if Recall@5 drops by more than 5 percentage points week-over-week.
The Dashboard You Actually Need
| Signal | Metric | Alert threshold |
|---|---|---|
| Token drift | 7-day avg prompt tokens | >20% week-over-week increase |
| Tail latency | P95 response time | >5 seconds |
| Cache efficiency | Cache hit ratio | <30% for repeated user sessions |
| Retrieval quality | Recall@5 on golden set | <75% or >5pp drop week-over-week |
These four metrics catch 80% of production AI quality issues before users escalate them.
Standard APM catches service health. These signals catch AI health. You need both.