A full-stack AI-powered incident management system that uses Azure OpenAI, Cosmos DB (11 containers, referenced entity model), and Azure AI Search to automatically analyse infrastructure incidents, surface root causes, and recommend remediation actions — built as an end-to-end instructional guide.
Tech Stack
PythonFastAPINext.jsAzure OpenAIAzure AI SearchCosmos DBAKSHelmKey VaultWorkload IdentityGitHub ActionsBicep
When an Azure service degrades at 2 AM, the on-call engineer needs answers faster than a Confluence search can provide them. This project builds an AI-powered incident management system that ingests raw incident data, retrieves relevant runbooks and past resolutions from a vector search index, and produces structured analysis with prioritised remediation steps — all without the engineer having to know which runbook to look at first.
The guide is written for cloud engineers who are comfortable with Azure but new to building AI-driven applications. Every architectural decision is explained with an honest trade-off note. Where simpler alternatives exist, they are called out explicitly.
The high-level architecture
High-level architecture: Front Door → APIM → FastAPI/AKS → Cosmos DB + AI Search + Azure OpenAI → Static Web Apps frontend.click to zoom
Component decisions at a glance:
Layer
Technology
Why this, not that
API
FastAPI on AKS
Async streaming, easy Pydantic validation; App Service would work too for lower ops overhead
Database
Cosmos DB (NoSQL API)
Flexible schema for evolving incident fields; SQL DB is a valid alternative if you need joins
Vector search
Azure AI Search
Hybrid BM25 + vector in one service; Postgres pgvector is simpler if you already run Postgres
LLM
Azure OpenAI GPT-4o
Required for Azure compliance; use OpenAI directly if you're not in a regulated environment
Frontend
Next.js on Static Web Apps
Static export + API routes; any React framework on App Service would work
Auth
Entra ID + Managed Identity
No secrets in code; client credentials are acceptable for internal tools
Phase 1: Data flow and analysis pipeline
Before writing code, understand what happens to an incident from the moment it is created to the moment the engineer sees a recommended action.
8-stage data flow: incident submission → Pydantic validation → Cosmos write → embed → hybrid search → prompt assembly → GPT-4o → split write to analysisResults + recommendedActions.click to zoom
The critical design point is stage 8: analysis results and recommended actions are written to separate Cosmos DB containers, not embedded inside the incident document. The next section explains why.
Phase 2: Prerequisites and environment setup
You need the following Azure resources before writing any code. Provision them in this order — some services depend on others being ready first.
# Set your subscription and resource groupSUBSCRIPTION_ID="your-subscription-id"RESOURCE_GROUP="rg-incident-troubleshooter"LOCATION="eastus"az account set--subscription$SUBSCRIPTION_IDaz group create --name$RESOURCE_GROUP--location$LOCATION
az search service create \--name search-incident-prod \ --resource-group $RESOURCE_GROUP\--location$LOCATION\--sku Basic
# Note: Basic SKU supports vector search. Standard gives more replicas/partitions.# For a dev environment, Free SKU (az search service create --sku Free) works# but is limited to 3 indexes and no SLA.
The original design embedded an AIAnalysis object inside the Incident document. This works when analysis results are small and stable, but it hits Cosmos DB's 2 MB document limit when incidents accumulate many log entries and recommended actions over time.
The referenced design uses separate containers per entity. Each related entity stores an incidentId field and is retrieved via cross-container queries when building the composite response.
# backend/app/models/enums.pyfrom enum import Enum
classSeverity(str, Enum): CRITICAL ="critical" HIGH ="high" MEDIUM ="medium" LOW ="low"classIncidentStatus(str, Enum): NEW ="New" IN_PROGRESS ="InProgress" RESOLVED ="Resolved" CLOSED ="Closed"
# backend/app/models/incident.pyfrom __future__ import annotations
from datetime import datetime
from typing import Optional
from pydantic import BaseModel, Field
import uuid
from.enums import Severity, IncidentStatus
classUser(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) email:str displayName:str role:str# "engineer" | "manager" | "readonly" createdAt: datetime = Field(default_factory=datetime.utcnow)classIncidentCreate(BaseModel): title:str description:str severity: Severity
affectedService:str affectedRegion:str reportedBy:str# user idclassIncident(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) title:str description:str severity: Severity
status: IncidentStatus = IncidentStatus.NEW
affectedService:str affectedRegion:str reportedBy:str createdAt: datetime = Field(default_factory=datetime.utcnow) updatedAt: datetime = Field(default_factory=datetime.utcnow) resolvedAt: Optional[datetime]=NoneclassIncidentLog(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str message:str logLevel:str# "info" | "warning" | "error" source:str# e.g. "AKS node pool", "API Management" timestamp: datetime = Field(default_factory=datetime.utcnow) createdBy:str# user idclassAnalysisResult(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str rootCauseSummary:str confidence:float# 0.0 – 1.0; GPT-4o is instructed to express uncertainty retrievedRunbooks:list[str]# runbook IDs used as context rawLlmResponse:str# full GPT-4o output for audit trail analysedAt: datetime = Field(default_factory=datetime.utcnow) modelVersion:str="gpt-4o-2024-11-20"classRecommendedAction(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str analysisResultId:str priority:int# 1 = highest; GPT-4o assigns 1–5 title:str description:str estimatedMinutes: Optional[int]=None runbookReference: Optional[str]=None# runbook ID if applicable status:str="pending"# "pending" | "in_progress" | "done" | "skipped"classIncidentComponent(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str componentName:str componentType:str# "AKS", "APIM", "CosmosDB", "OpenAI", etc. isAffected:bool notes: Optional[str]=NoneclassRunbook(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) title:str content:str# full runbook text, chunked at index time service:str# "AKS" | "CosmosDB" | "OpenAI" | ... tags:list[str]=[] version:str="1.0" createdAt: datetime = Field(default_factory=datetime.utcnow) updatedAt: datetime = Field(default_factory=datetime.utcnow)classIncidentRunbook(BaseModel):"""Junction entity: which runbooks were linked to which incident."""id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str runbookId:str linkedAt: datetime = Field(default_factory=datetime.utcnow) linkedBy:str# "ai" or user idclassFeedback(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str analysisResultId:str rating:int# 1–5; used to fine-tune prompt engineering over time comment: Optional[str]=None createdBy:str createdAt: datetime = Field(default_factory=datetime.utcnow)classTag(BaseModel):id:str= Field(default_factory=lambda:str(uuid.uuid4())) name:str# e.g. "networking", "storage", "authentication" colour:str# hex colour for UI badgeclassIncidentTag(BaseModel):"""Junction entity: many-to-many between incidents and tags."""id:str= Field(default_factory=lambda:str(uuid.uuid4())) incidentId:str tagId:str# Composite response model — assembled from multiple containersclassIncidentDetail(BaseModel): incident: Incident
logs:list[IncidentLog]=[] analysisResult: Optional[AnalysisResult]=None recommendedActions:list[RecommendedAction]=[] components:list[IncidentComponent]=[] linkedRunbooks:list[IncidentRunbook]=[] tags:list[Tag]=[] feedback:list[Feedback]=[]
Honest trade-off: Referenced entities are not idiomatic Cosmos DB. Cosmos DB is optimised for point reads on a single document with a known partition key. Cross-container reads (retrieving all AnalysisResult documents for an incidentId) require a query scan, which is slower and more expensive than a point read on an embedded document. The referenced design is justified here because: (1) recommended actions can number in the dozens per incident, (2) incident logs can be unbounded, and (3) feedback and runbook links need to be queryable independently. For incidents that will never exceed ~20 analysed items, the embedded design is simpler and cheaper.
Phase 5: Configuration and secrets
# backend/app/config.pyfrom azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
from pydantic_settings import BaseSettings
import os
classSettings(BaseSettings): cosmos_endpoint:str="" cosmos_key:str="" cosmos_database:str="IncidentDB" openai_endpoint:str="" openai_api_key:str="" openai_embedding_deployment:str="text-embedding-3-small" openai_chat_deployment:str="gpt-4o" search_endpoint:str="" search_key:str="" search_index_name:str="runbooks" key_vault_url:str=""classConfig: env_file =".env"defload_from_key_vault(settings: Settings)-> Settings:"""Load secrets from Key Vault when running in Azure (Managed Identity available)."""ifnot settings.key_vault_url:return settings
try: credential = DefaultAzureCredential() client = SecretClient(vault_url=settings.key_vault_url, credential=credential) secrets ={"cosmos-endpoint":"cosmos_endpoint","cosmos-key":"cosmos_key","openai-endpoint":"openai_endpoint","openai-api-key":"openai_api_key","search-endpoint":"search_endpoint","search-key":"search_key",}for secret_name, attr in secrets.items():try: value = client.get_secret(secret_name).value
object.__setattr__(settings, attr, value)except Exception:pass# secret not present; environment variable takes precedenceexcept Exception:# Managed Identity not available (local dev); use .env valuespassreturn settings
_settings: Settings |None=Nonedefget_settings()-> Settings:global _settings
if _settings isNone: _settings = load_from_key_vault(Settings())return _settings
Phase 6: Cosmos DB — the logical data model
Logical data model: 11 Cosmos containers, referenced entities. Incident is the central aggregate; AnalysisResult and RecommendedAction are written after GPT-4o analysis; IncidentRunbook and IncidentTag are junction containers for many-to-many relationships.click to zoom
Create all 11 containers with appropriate partition keys:
Containers where you query "all items for an incident" (incidentLogs, analysisResults, recommendedActions, incidentComponents, incidentRunbooks, feedback, incidentTags) use /incidentId. This means all related documents land in the same logical partition — a single cross-document query within one partition is efficient and predictable.
Containers that are looked up by their own ID (incidents, runbooks, tags, users) use /id. These are point reads: you know the ID, you pay for exactly one RU fetch.
The only expensive operation in this design is building the IncidentDetail composite: it fires one query per related container (8 queries). At 400 RU/s per container, this is fine for low-volume incident management. At high volume, consider pre-materialising the composite into a read-optimised container using a Change Feed trigger.
Phase 7: Cosmos DB service layer
# backend/app/services/cosmos_service.pyfrom azure.cosmos.aio import CosmosClient
from azure.cosmos import exceptions as cosmos_exceptions
from typing import Any, Optional
from..config import get_settings
from..models.incident import( Incident, IncidentLog, AnalysisResult, RecommendedAction, IncidentComponent, IncidentRunbook, Feedback, Tag, IncidentTag, User, Runbook
)# Container name → partition key fieldCONTAINER_PARTITION_KEYS:dict[str,str]={"incidents":"id","incidentLogs":"incidentId","analysisResults":"incidentId","recommendedActions":"incidentId","incidentComponents":"incidentId","runbooks":"id","incidentRunbooks":"incidentId","feedback":"incidentId","tags":"id","incidentTags":"incidentId","users":"id",}classCosmosService:def__init__(self): settings = get_settings() self._client = CosmosClient(settings.cosmos_endpoint, settings.cosmos_key) self._database_name = settings.cosmos_database
self._db =Noneasyncdef_get_container(self, container_name:str):if self._db isNone: self._db = self._client.get_database_client(self._database_name)return self._db.get_container_client(container_name)asyncdefupsert(self, container_name:str, item: Any)->dict:"""Upsert a Pydantic model into the named container.""" container =await self._get_container(container_name) doc = item.model_dump(mode="json")returnawait container.upsert_item(doc)asyncdefget_by_id(self, container_name:str, item_id:str, partition_key:str)-> Optional[dict]:try: container =await self._get_container(container_name)returnawait container.read_item(item=item_id, partition_key=partition_key)except cosmos_exceptions.CosmosResourceNotFoundError:returnNoneasyncdeflist_by_incident(self, container_name:str, incident_id:str)->list[dict]:"""Return all documents in container_name where incidentId = incident_id.""" container =await self._get_container(container_name) query ="SELECT * FROM c WHERE c.incidentId = @incidentId" parameters =[{"name":"@incidentId","value": incident_id}] items =[]asyncfor item in container.query_items( query=query, parameters=parameters, partition_key=incident_id,): items.append(item)return items
asyncdeflist_all(self, container_name:str, max_items:int=100)->list[dict]:"""Scan all documents in a container — use sparingly (full partition scan).""" container =await self._get_container(container_name) items =[]asyncfor item in container.query_items( query="SELECT * FROM c ORDER BY c._ts DESC OFFSET 0 LIMIT @limit", parameters=[{"name":"@limit","value": max_items}], enable_cross_partition_query=True,): items.append(item)return items
asyncdefupsert_incident(self, incident: Incident)->dict:returnawait self.upsert("incidents", incident)asyncdefupsert_log(self, log: IncidentLog)->dict:returnawait self.upsert("incidentLogs", log)asyncdefupsert_analysis(self, result: AnalysisResult)->dict:returnawait self.upsert("analysisResults", result)asyncdefupsert_actions(self, actions:list[RecommendedAction])->list[dict]:return[await self.upsert("recommendedActions", a)for a in actions]asyncdefupsert_component(self, component: IncidentComponent)->dict:returnawait self.upsert("incidentComponents", component)asyncdefupsert_runbook_link(self, link: IncidentRunbook)->dict:returnawait self.upsert("incidentRunbooks", link)asyncdefupsert_feedback(self, fb: Feedback)->dict:returnawait self.upsert("feedback", fb)asyncdefupsert_tag(self, tag: Tag)->dict:returnawait self.upsert("tags", tag)asyncdefupsert_incident_tag(self, it: IncidentTag)->dict:returnawait self.upsert("incidentTags", it)asyncdefclose(self):await self._client.close()
Phase 8: Azure AI Search — runbook index
Before the RAG pipeline can retrieve runbooks, you need to create the search index and populate it with chunked runbook content.
No client secrets in code or environment variables. Key Vault holds all service credentials. Workload Identity lets the pod fetch them via OIDC federation — no COSMOS_KEY in the Kubernetes Secret manifest.
Prompt injection awareness. The system prompt instructs GPT-4o to answer only from retrieved context and to flag uncertainty. For critical-severity incidents, route the incident description through Azure Content Safety before embedding to detect adversarial prompt injections in engineer-submitted text.
Audit trail. The raw GPT-4o response is stored in AnalysisResult.rawLlmResponse. If an AI recommendation causes an incorrect remediation action, you can audit exactly what the model said. Do not delete this field.
Private endpoints. Cosmos DB and AI Search data-plane endpoints are accessible only within the VNet. The AKS pod connects over a private IP. The public endpoint is disabled on both services.
Cosmos DB provisioned throughput (400 RU/s × 11 containers) adds up quickly. For a tool used intermittently, switch all containers to serverless mode — cost scales to zero when idle and is typically 60–80% cheaper for low-volume workloads. The trade-off is no throughput guarantee and no multi-region writes in serverless mode.
Phase 17: Monitoring and alerting
# Create a Log Analytics workspace for AKS diagnosticsaz monitor log-analytics workspace create \ --resource-group $RESOURCE_GROUP\ --workspace-name log-incident-prod \--location$LOCATION# Enable Container Insights on AKSAKS_ID=$(az aks show --name aks-incident-prod --resource-group $RESOURCE_GROUP --queryid-o tsv)LOG_ID=$(az monitor log-analytics workspace show --workspace-name log-incident-prod --resource-group $RESOURCE_GROUP --queryid-o tsv)az monitor diagnostic-settings create \--resource$AKS_ID\--name diag-aks-incident \--workspace$LOG_ID\--metrics'[{"category": "AllMetrics", "enabled": true}]'\--logs'[{"category": "kube-apiserver", "enabled": true}, {"category": "kube-controller-manager", "enabled": true}]'
Key metrics to alert on:
Metric
Alert threshold
Action
GPT-4o error rate (5xx from OpenAI)
> 5% in 5 min
Page on-call; check OpenAI service health
Cosmos DB 429 (throttled)
Any in 5 min
Scale up RU/s or switch to serverless
AI Search latency (p99)
> 2000ms
Check index fragmentation; consider Standard tier
AKS pod restart count
> 3 in 10 min
Check OOM kills; increase memory limits
Analysis failure rate
> 10% in 15 min
Check GPT-4o deployment; review SYSTEM_PROMPT
What you'd do differently in production
This guide makes several choices that are appropriate for learning but need revisiting before a production hardening review:
Cosmos DB serverless vs. provisioned throughput. The guide provisions 400 RU/s per container for simplicity. In production, audit actual usage patterns first. Most incident tools are bursty — serverless is cheaper unless you have sustained high traffic.
GPT-4o JSON mode reliability.response_format: {"type": "json_object"} reduces but does not eliminate malformed responses. Add a retry with exponential backoff (up to 3 attempts) before surfacing a parse error to the user.
The list_all cross-partition scan. The GET /incidents/ endpoint does a full cross-partition scan limited to 100 items. This is fine at low volume. At 10,000+ incidents, add a read-optimised index container (updated via Cosmos Change Feed) with summary fields only, sorted by createdAt.
Runbook versioning. When a runbook is updated, existing IncidentRunbook links still point to the old content. Either version runbooks explicitly (immutable chunks, new IDs on update) or add a runbookVersion field to the junction entity.
Confidence score calibration. GPT-4o's self-reported confidence is not calibrated. A score of 0.8 does not mean 80% accuracy. Treat it as a relative indicator (higher = model found strong runbook matches) rather than an absolute probability. Build a feedback loop using the Feedback entity to measure whether high-confidence recommendations actually resolved incidents.