AzureFixes Logo
AZUREFIXES
DEBUG FASTER. DEPLOY SMARTER.
Published on
6 min read

AI Observability: When Everything Looks Healthy But AI Is Wrong

Your AI app can fail… even when nothing is technically broken.

The server is healthy. The API is fast. The dashboards are all green.

But your chatbot still gives a completely wrong answer to a customer.

And nobody notices.

That's the weird thing about GenAI systems.

In traditional software, failures are obvious:

  • APIs fail
  • Databases crash
  • Latency spikes

But LLM apps fail differently.

They can be fully online, fast, and technically "working"… and still be wrong.

What is AI Observability?

AI observability helps teams understand why an AI system behaved a certain way.

A patient asks:

Do I need to fast before tomorrow's blood test?

The AI replies:

No fasting required.

But fasting was mandatory.

Nothing crashed. No alerts fired. Everything looked normal.

Yet the answer was still wrong.


A key concept is a Trace.

A trace is the full step-by-step journey of an AI request — from the user question, to retrieval, to model response, to the final answer.

StepWhat happens
User sends questionInput captured
Retrieval runsRelevant chunks fetched from vector store
Prompt assembledContext + question combined
LLM respondsModel generates answer
Answer returnedUser receives response

Every step is logged. Every step can be inspected.

Quality in AI Means

Unlike traditional software where quality = uptime + latency, AI quality means:

  • Useful answers — does it actually help the user?
  • Grounded responses — is the answer based on retrieved context?
  • Fewer hallucinations — does the model invent facts?
  • Safe outputs — does it avoid harmful content?
  • Consistent behavior — does quality hold after prompt changes?

These cannot be measured with CPU graphs or error rates. They require a completely different observability stack.

The Traditional vs AI Observability Gap

Traditional monitoring tells you — server healthy, API fast, error rate near zero, all dashboards green.

But it cannot tell you this:

Assistant responded: "You don't need to fast before your blood test." — Completely wrong. Fasting was mandatory.

Notice the gap. Traditional monitoring could not catch that failure.

The Building Blocks of AI Observability

Traces

Full journey logs. Every request traced end-to-end.

Trace 2025-12-01T09:32:00
  Input: "Do I need to fast before my blood test?"
  Retrieval: 3 chunks fetched
    chunk_1: "CBC blood panel - no fasting"
    chunk_2: "General wellness - no fasting"  <- Wrong chunk retrieved
    chunk_3: "Glucose test - fasting required" <- Not retrieved
  Prompt: system + 3 chunks + user question
  Output: "No fasting required."

The trace shows retrieval fetched the wrong documents. The model answered based on those wrong documents.

Evaluators

Automated checkers that score responses.

response = ai_app.answer(question)

groundedness_score = evaluate_groundedness(response, retrieved_context)
# Is the answer based on what was actually retrieved?

relevance_score = evaluate_relevance(response, question)
# Does the answer actually address the question?

if groundedness_score < 0.7:
    flag_for_review(response)

Common evaluators:

EvaluatorWhat it checks
GroundednessIs the answer based on retrieved context?
RelevanceDoes it address the actual question?
ToxicityIs the content harmful?
CoherenceIs the response logically structured?
SimilarityDoes it match expected answers?

Feedback Loops

Human review and automated scoring that feeds back into the system.

User gives thumbs down on AI response
Feedback logged to observability system
Team investigates: Wrong retrieval chunk was used
Embedding model retrained on better document splits
Quality improves on similar questions

Azure AI Foundry Observability

Microsoft built Azure AI Foundry with observability built in.

CapabilityWhat it does
TracingFull end-to-end request traces
EvaluationBuilt-in quality evaluators
MonitoringReal-time dashboards for AI quality
Risk & SafetyDetects harmful content
Prompt FlowVisual pipeline with trace steps visible

Setting Up Tracing

from azure.ai.inference.tracing import AIInferenceInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

tracer_provider = TracerProvider()
tracer_provider.add_span_processor(
    SimpleSpanProcessor(your_exporter)
)

AIInferenceInstrumentor().instrument(
    tracer_provider=tracer_provider,
    enable_content_recording=True
)

Every AI call is now traced automatically.

Running Evaluations

from azure.ai.evaluation import evaluate, GroundednessEvaluator, RelevanceEvaluator

evaluators = {
    "groundedness": GroundednessEvaluator(model_config=azure_openai_config),
    "relevance": RelevanceEvaluator(model_config=azure_openai_config),
}

results = evaluate(
    data="test_dataset.jsonl",
    evaluators=evaluators,
    azure_ai_project=project,
    output_path="./evaluation_results.json"
)

print(f"Groundedness score: {results['groundedness']:.2f}")
print(f"Relevance score: {results['relevance']:.2f}")

The Continuous Improvement Loop

Production AI App
Traces + Scores collected automatically
Team reviews low-scoring traces weekly
Root cause: Retrieval? Prompt? Model config?
Fix deployed → new eval run
Compare before/after scores
Repeat

This loop is what separates teams that improve AI quality from teams that just hope it works.

What Good AI Observability Looks Like

Dashboard you actually want:

AI Quality DashboardLast 7 Days
Groundedness Score:    0.84  (+0.06)
Relevance Score:       0.79  (+0.03)
Hallucination Rate:    3.2%  (-1.1%)
User Thumbs Up:        78%   (+5%)
Flagged Responses:     12    (-8)

Low-Score Traces This Week:
  - Query: "fasting before blood test"  groundedness: 0.32
  - Query: "medication interactions"    groundedness: 0.41
  - Query: "post-surgery diet"          groundedness: 0.58

Numbers trending in the right direction. Worst traces surfaced for manual review. Root causes visible.

Quick Reference

ConceptOne-line definition
TraceFull log of one AI request, step by step
EvaluatorAutomated scorer for response quality
GroundednessMeasure of how much the answer uses retrieved context
HallucinationWhen the model invents facts not in the context
Feedback loopSystem for catching errors and improving over time
AI FoundryMicrosoft's platform for building and observing AI apps

What to Do This Week

  1. If you have an LLM app in production — add tracing now. Even basic OpenTelemetry tracing is better than nothing.

  2. Run a groundedness evaluation on 50 recent queries. You will likely find 5–10 where the model answered from hallucination.

  3. Set up a weekly review of the bottom 10% traces. That review alone will surface the biggest quality issues.

AI observability is not a nice-to-have.

It's the difference between knowing your AI works… and hoping it does.