A 20-phase end-to-end guide to building a production-grade Retrieval-Augmented Generation chatbot on Azure: FastAPI backend, Next.js frontend, Azure AI Search vector store, GPT-4o, AKS deployment, Workload Identity, GitHub Actions CI/CD, and Container Insights monitoring. Targets cloud beginners — every command explained, every concept introduced before use.
Tech Stack
PythonFastAPINext.jsAzure OpenAIAzure AI SearchAKSHelmDockerACRKey VaultBlob StorageWorkload IdentityGitHub ActionsTerraform
What You Will Build
By the end of this guide you will have a working chatbot that answers questions using your own documents. Upload a PDF or DOCX file, ask the chatbot a question about it, and GPT-4o will answer using only the content in that file — not its training data.
The stack:
FastAPI backend with two endpoints: /upload (ingest documents) and /chat (RAG retrieval + GPT completion)
Next.js frontend with a chat UI and file upload interface
Azure AI Search as the vector store — stores document chunks + embeddings, returns the most relevant chunks for each query
Azure OpenAI for embeddings (text-embedding-3-small) and chat completion (gpt-4o)
Azure Blob Storage for raw document storage
AKS (Azure Kubernetes Service) for container orchestration
Key Vault for secrets — no passwords in code or environment variables
GitHub Actions for CI/CD — test, scan, build, push, deploy
The Architecture
Full architecture: Next.js + FastAPI on AKS, private VNet, Azure AI Search vector store, Azure OpenAI GPT-4o, Key Vault via Workload Identity.click to zoom
How RAG works in one paragraph: When a user uploads a document, the backend splits it into chunks (~512 tokens each), sends each chunk to the embeddings API to get a vector representation, and stores both the text and vector in AI Search. When a user sends a chat message, the backend embeds the question, asks AI Search to return the most similar chunks, then builds a prompt that contains those chunks as context and asks GPT-4o to answer the question using only that context. This prevents hallucination — the model can only say things that are in your documents.
Prerequisites
Before running any commands, you need:
Requirement
Minimum version / detail
Azure subscription
Owner or Contributor role
Azure CLI
2.60+ (az version)
kubectl
1.29+ (kubectl version --client)
Helm
3.14+ (helm version)
Docker
24+ (docker version)
Python
3.11+ (python --version)
Node.js
20 LTS (node --version)
GitHub account
For CI/CD — free tier is fine
Register Azure resource providers (one-time per subscription):
az provider register --namespace Microsoft.ContainerService # AKSaz provider register --namespace Microsoft.ContainerRegistry # ACRaz provider register --namespace Microsoft.KeyVault
az provider register --namespace Microsoft.Storage
az provider register --namespace Microsoft.CognitiveServices # Azure OpenAIaz provider register --namespace Microsoft.Search # AI Searchaz provider register --namespace Microsoft.Network
az provider register --namespace Microsoft.Insights
az provider register --namespace Microsoft.OperationalInsights
Phase 1 — Resource Group and Variables
Create a resource group to hold everything. Using a single resource group for this project makes cleanup easy — one az group delete removes everything.
# Set these once — every command below references themLOCATION="eastus"RG="rg-rag-chatbot"PREFIX="ragbot"SUBSCRIPTION=$(az account show --queryid-o tsv)az group create \--name$RG\--location$LOCATION\--tagsproject=rag-chatbot environment=dev
echo"Resource group created: $RG in $LOCATION"
Why East US? Azure OpenAI GPT-4o and text-embedding-3-small are available in East US. Check Azure OpenAI region availability before choosing a different region — not all models are available everywhere.
Phase 2 — Azure Blob Storage
Blob Storage stores the raw uploaded documents before they are chunked and indexed.
allow-blob-public-access false is critical. Without it, anyone who guesses a blob URL can download your users' uploaded documents. This setting can't be changed after the fact without interrupting access.
Phase 3 — Azure AI Search
AI Search stores the document chunks and their vector embeddings. When a user asks a question, AI Search finds the most semantically similar chunks using vector search.
SEARCH_NAME="${PREFIX}-search"az search service create \--name$SEARCH_NAME\ --resource-group $RG\--location$LOCATION\--sku basic \ --partition-count 1\ --replica-count 1# Wait for provisioning (~2 minutes)az search service show \--name$SEARCH_NAME\ --resource-group $RG\--query provisioningState
# Get admin key (needed to create the index)SEARCH_ADMIN_KEY=$(az search admin-key show \ --service-name $SEARCH_NAME \ --resource-group $RG \--query primaryKey -o tsv)echo"Search endpoint: https://${SEARCH_NAME}.search.windows.net"
SKU choice: Basic supports semantic ranking and vector search, which are required for RAG. The Free tier does not support vector fields. For production use Standard S1 or higher.
Create the search index schema — this defines the fields that each document chunk will have:
Quota note: The --sku-capacity 10 means 10K tokens per minute. For a single-user dev environment this is plenty. For production, request a quota increase via the Azure portal before deploying.
Phase 5 — Key Vault
All secrets (API keys, connection strings) live in Key Vault. The FastAPI backend never reads secrets from environment variables — it uses the Azure SDK to fetch them at runtime via Managed Identity.
KV_NAME="${PREFIX}-kv-$(openssl rand -hex3)"az keyvault create \--name$KV_NAME\ --resource-group $RG\--location$LOCATION\--sku standard \ --enable-rbac-authorization true\ --enable-soft-delete true\ --soft-delete-retention-days 90# Store the secrets we have so farOPENAI_KEY=$(az cognitiveservices account keys list \--name $OPENAI_NAME --resource-group $RG --query key1 -o tsv)az keyvault secret set --vault-name $KV_NAME\--name"openai-api-key"--value"$OPENAI_KEY"az keyvault secret set --vault-name $KV_NAME\--name"openai-endpoint"--value"$OPENAI_ENDPOINT"az keyvault secret set --vault-name $KV_NAME\--name"search-endpoint"--value"https://${SEARCH_NAME}.search.windows.net"az keyvault secret set --vault-name $KV_NAME\--name"search-admin-key"--value"$SEARCH_ADMIN_KEY"STORAGE_CONN=$(az storage account show-connection-string \--name $STORAGE_NAME --resource-group $RG --query connectionString -o tsv)az keyvault secret set --vault-name $KV_NAME\--name"storage-connection-string"--value"$STORAGE_CONN"echo"Key Vault: $KV_NAME — 5 secrets stored"
backend/app/config.py — loads secrets from Key Vault at startup:
import os
from functools import lru_cache
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
@lru_cache(maxsize=None)defget_secret(name:str)->str: kv_uri = os.environ["KEY_VAULT_URI"]# only env var the app needs client = SecretClient(vault_url=kv_uri, credential=DefaultAzureCredential())return client.get_secret(name).value
backend/app/ingest.py — document ingestion:
import hashlib
import io
from typing import BinaryIO
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import AzureOpenAI
from azure.storage.blob import BlobServiceClient
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from.config import get_secret
defget_openai_client()-> AzureOpenAI:return AzureOpenAI( azure_endpoint=get_secret("openai-endpoint"), api_key=get_secret("openai-api-key"), api_version="2024-10-21",)defembed(texts:list[str])->list[list[float]]: client = get_openai_client() response = client.embeddings.create(input=texts, model="text-embedding-3-small",)return[item.embedding for item in response.data]defget_search_client()-> SearchClient:return SearchClient( endpoint=get_secret("search-endpoint"), index_name="documents", credential=AzureKeyCredential(get_secret("search-admin-key")),)defupload_to_blob(filename:str, content:bytes)->str: conn_str = get_secret("storage-connection-string") blob_client = BlobServiceClient.from_connection_string(conn_str) container = blob_client.get_container_client("docs") container.upload_blob(name=filename, data=content, overwrite=True)return filename
defextract_text(filename:str, content:bytes)->str:if filename.endswith(".pdf"):import pypdf
reader = pypdf.PdfReader(io.BytesIO(content))return"\n".join(page.extract_text()or""for page in reader.pages)elif filename.endswith(".docx"):import docx
doc = docx.Document(io.BytesIO(content))return"\n".join(para.text for para in doc.paragraphs)elif filename.endswith(".txt"):return content.decode("utf-8", errors="replace")else:raise ValueError(f"Unsupported file type: {filename}")defingest_document(filename:str, content:bytes)->dict:# 1. Store raw file in Blob Storage upload_to_blob(filename, content)# 2. Extract text text = extract_text(filename, content)# 3. Chunk into 512-token segments with 64-token overlap splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64) chunks = splitter.split_text(text)# 4. Get embeddings for all chunks (batch of 100) all_embeddings =[]for i inrange(0,len(chunks),100): batch = chunks[i : i +100] all_embeddings.extend(embed(batch))# 5. Build search documents documents =[] file_hash = hashlib.md5(content).hexdigest()[:8]for idx,(chunk, embedding)inenumerate(zip(chunks, all_embeddings)): documents.append({"id":f"{file_hash}-{idx}","content": chunk,"source_file": filename,"page_number":0,"chunk_index": idx,"embedding": embedding,})# 6. Upload to AI Search index search = get_search_client() results = search.upload_documents(documents=documents) succeeded =sum(1for r in results if r.succeeded)return{"filename": filename,"chunks":len(chunks),"indexed": succeeded,}
Document ingestion pipeline:
Document ingestion: file → Blob Storage → chunk → embed → AI Search vector index. Each chunk stored with its 1536-dim embedding for vector retrieval.click to zoom
backend/app/chat.py — RAG retrieval and streaming:
from typing import AsyncGenerator
from openai import AzureOpenAI
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
from.config import get_secret
SYSTEM_PROMPT ="""You are a helpful assistant. Answer questions based ONLY on the context provided below.
If the answer is not in the context, say "I don't know based on the provided documents."
Do not use any knowledge outside the provided context. Cite the source file when possible."""defretrieve_chunks(query:str, top_k:int=5)->list[dict]:# Embed the user's queryfrom openai import AzureOpenAI
openai_client = AzureOpenAI( azure_endpoint=get_secret("openai-endpoint"), api_key=get_secret("openai-api-key"), api_version="2024-10-21",) query_embedding = openai_client.embeddings.create(input=[query], model="text-embedding-3-small",).data[0].embedding
# Hybrid search: vector similarity + BM25 keyword match search_client = SearchClient( endpoint=get_secret("search-endpoint"), index_name="documents", credential=AzureKeyCredential(get_secret("search-admin-key")),) results = search_client.search( search_text=query,# BM25 keyword search vector_queries=[ VectorizedQuery( vector=query_embedding, k_nearest_neighbors=top_k, fields="embedding",)], top=top_k, select=["content","source_file","chunk_index"],)return[{"content": r["content"],"source": r["source_file"]}for r in results]defbuild_context(chunks:list[dict])->str: parts =[]for i, chunk inenumerate(chunks,1): parts.append(f"[Source: {chunk['source']}]\n{chunk['content']}")return"\n\n---\n\n".join(parts)asyncdefstream_chat( question:str, history:list[dict])-> AsyncGenerator[str,None]: chunks = retrieve_chunks(question) context = build_context(chunks) messages =[{"role":"system","content": SYSTEM_PROMPT},{"role":"user","content":f"Context:\n{context}\n\nQuestion: {question}"},] openai_client = AzureOpenAI( azure_endpoint=get_secret("openai-endpoint"), api_key=get_secret("openai-api-key"), api_version="2024-10-21",) stream = openai_client.chat.completions.create( model="gpt-4o", messages=messages, stream=True, temperature=0, max_tokens=1024,)for event in stream:if event.choices and event.choices[0].delta.content:yield event.choices[0].delta.content
RAG chat retrieval pipeline:
RAG retrieval: question → embed → AI Search hybrid search → build context prompt → GPT-4o streaming → browser. Temperature=0 for deterministic, grounded answers.click to zoom
backend/app/main.py:
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from.ingest import ingest_document
from.chat import stream_chat
app = FastAPI(title="RAG Chatbot API", version="1.0.0")app.add_middleware( CORSMiddleware, allow_origins=["https://your-frontend-domain.com"],# tighten before prod allow_methods=["GET","POST"], allow_headers=["*"],)classChatRequest(BaseModel): question:str history:list[dict]=[]@app.get("/health")defhealth():return{"status":"healthy"}@app.post("/upload")asyncdefupload(file: UploadFile = File(...)):iffile.size andfile.size >50*1024*1024:# 50MB limitraise HTTPException(status_code=413, detail="File too large (max 50MB)") allowed ={".pdf",".docx",".txt"} suffix ="."+file.filename.rsplit(".",1)[-1].lower()if suffix notin allowed:raise HTTPException(status_code=400, detail=f"Unsupported type. Allowed: {allowed}") content =awaitfile.read() result = ingest_document(file.filename, content)return result
@app.post("/chat")asyncdefchat(req: ChatRequest):asyncdefgenerate():asyncfor token in stream_chat(req.question, req.history):yieldf"data: {token}\n\n"yield"data: [DONE]\n\n"return StreamingResponse(generate(), media_type="text/event-stream")
admin-enabled false is intentional. Service principals and Managed Identity pull images — no username/password credentials needed.
Phase 11 — AKS Cluster
AKS_NAME="${PREFIX}-aks"az aks create \--name$AKS_NAME\ --resource-group $RG\--location$LOCATION\ --node-count 2\ --node-vm-size Standard_D2s_v3 \ --enable-oidc-issuer \ --enable-workload-identity \ --enable-managed-identity \ --attach-acr $ACR_NAME\ --network-plugin azure \ --enable-cluster-autoscaler \ --min-count 2\ --max-count 5\ --generate-ssh-keys
# Get credentialsaz aks get-credentials \ --resource-group $RG\--name$AKS_NAME\ --overwrite-existing
# Verify cluster is upkubectl get nodes
--enable-oidc-issuer and --enable-workload-identity are required for Workload Identity — the mechanism that lets your pods authenticate to Key Vault using Managed Identity instead of a client secret.
--attach-acr grants the AKS managed identity the AcrPull role on the registry — pods can pull images without image pull secrets.
Phase 12 — Workload Identity for Key Vault
Workload Identity lets your pods authenticate to Azure services using a federated credential. The pod gets an OIDC token from AKS, exchanges it for an Azure AD token, and uses that to call Key Vault. No secrets required.
# Create managed identity for the backendIDENTITY_NAME="mi-rag-backend"az identity create \--name$IDENTITY_NAME\ --resource-group $RG\--location$LOCATIONIDENTITY_CLIENT_ID=$(az identity show \--name $IDENTITY_NAME \ --resource-group $RG \--query clientId -o tsv)IDENTITY_OBJECT_ID=$(az identity show \--name $IDENTITY_NAME \ --resource-group $RG \--query principalId -o tsv)# Grant Key Vault Secrets User to the managed identityKV_ID=$(az keyvault show --name $KV_NAME --resource-group $RG --queryid-o tsv)az role assignment create \--role"Key Vault Secrets User"\ --assignee-object-id $IDENTITY_OBJECT_ID\--scope$KV_ID# Create federated credential — links the Kubernetes service account to the MIAKS_OIDC_ISSUER=$(az aks show \--name $AKS_NAME \ --resource-group $RG \--query"oidcIssuerProfile.issuerUrl"-o tsv)az identity federated-credential create \--name"rag-backend-fedcred"\ --identity-name $IDENTITY_NAME\ --resource-group $RG\--issuer"$AKS_OIDC_ISSUER"\--subject"system:serviceaccount:default:rag-backend-sa"\--audience"api://AzureADTokenExchange"echo"Managed Identity client ID: $IDENTITY_CLIENT_ID"
Defense in depth: 4 independent security layers. An attacker must breach all 4 to reach the data — failure at any layer stops the attack chain.click to zoom
A client secret is a string stored somewhere (Key Vault, CI/CD variable, .env file). It can be copied, leaked in logs, or committed to git. Workload Identity eliminates the secret entirely: the Kubernetes service account token is exchanged for an Azure AD token via OIDC federation. Nothing to leak. Nothing to rotate on a schedule.
# Verify workload identity is working from inside a podkubectl run -it debug --image=curlimages/curl --rm -- \curl-H"Metadata: true"\"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://vault.azure.net"# Should return: {"access_token":"eyJ...","expires_in":"..."}
Layer 4 detail — rate limiting and prompt injection:
Add rate limiting middleware to FastAPI:
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)@app.post("/chat")@limiter.limit("100/minute")asyncdefchat(req: ChatRequest, request: Request):# Prompt injection guardiflen(req.question)>50_000:raise HTTPException(status_code=400, detail="Question too long")# Basic injection detection (expand based on your threat model) forbidden =["ignore previous instructions","system prompt","jailbreak"] lower_q = req.question.lower()ifany(phrase in lower_q for phrase in forbidden):raise HTTPException(status_code=400, detail="Invalid input")...
# Enable Container Insights on the AKS clusteraz aks enable-addons \--name$AKS_NAME\ --resource-group $RG\--addons monitoring \ --workspace-resource-id $(az monitor log-analytics workspace create \ --resource-group $RG \ --workspace-name "${PREFIX}-logs"\--queryid-o tsv)
Useful KQL queries for the RAG chatbot:
// Pod restarts in the last hourKubePodInventory
|where TimeGenerated >ago(1h)|where Namespace =="default"|where ContainerRestartCount >0|summarize restarts =sum(ContainerRestartCount)by PodName = Name
|orderby restarts desc// Backend 5xx errorsContainerLog
|where TimeGenerated >ago(1h)|where Name contains"rag-backend"|where LogEntry contains" 5"|project TimeGenerated, LogEntry
|orderby TimeGenerated desc// Average response latency (from FastAPI logs)ContainerLog
|where Name contains"rag-backend"|where LogEntry matches regex@'"duration":\d+'|extend duration =toint(extract('"duration":(\\d+)',1, LogEntry))|summarizeavg(duration),percentile(duration,95)bybin(TimeGenerated,5m)
Alert on pod restarts:
az monitor metrics alert create \ --resource-group $RG\--name"rag-backend-pod-restarts"\--scopes$(az aks show -g $RG -n $AKS_NAME --queryid-o tsv)\--condition"avg kube_pod_container_status_restarts_total > 3"\ --window-size 5m \ --evaluation-frequency 1m \--severity2
Phase 18 — Testing Plan
Functional tests (backend/tests/):
# tests/test_ingest.pyimport pytest
from app.ingest import extract_text
deftest_extract_pdf():withopen("tests/fixtures/sample.pdf","rb")as f: text = extract_text("sample.pdf", f.read())assertlen(text)>100assertisinstance(text,str)deftest_extract_docx():withopen("tests/fixtures/sample.docx","rb")as f: text = extract_text("sample.docx", f.read())assertlen(text)>10deftest_unsupported_type():with pytest.raises(ValueError,match="Unsupported file type"): extract_text("file.xlsx",b"data")
RAG quality tests — verify the chatbot answers correctly from a known document:
# tests/test_rag_quality.pyimport pytest
from app.chat import retrieve_chunks
@pytest.mark.integrationdeftest_retrieval_finds_relevant_chunk():# Assumes sample.pdf has been ingested already chunks = retrieve_chunks("What is the main topic of the document?", top_k=3)assertlen(chunks)==3assertall("content"in c for c in chunks)@pytest.mark.integrationdeftest_retrieval_returns_source(): chunks = retrieve_chunks("any question", top_k=1)assert"source"in chunks[0]assert chunks[0]["source"].endswith((".pdf",".docx",".txt"))
Load test with locust:
# tests/locustfile.pyfrom locust import HttpUser, task, between
classChatUser(HttpUser): wait_time = between(2,5)@task(3)defask_question(self): self.client.post("/chat", json={"question":"What are the key points in the document?","history":[]}, timeout=30)@task(1)defhealth_check(self): self.client.get("/health")
OpenAI costs vary widely by usage. For a prototype with 100 questions per day averaging 1,000 output tokens each: GPT-4o at 15/1Moutputtokens=45/month. Use Azure OpenAI's built-in quota to cap spend.
Teardown when you're done:
# Delete everything — this is irreversibleaz group delete --name$RG--yes --no-wait
# Optional: purge Key Vault (otherwise soft-deleted for 90 days)az keyvault purge --name$KV_NAME--location$LOCATIONecho"All resources deleted. GitHub Actions will fail on next push — update or delete the secrets."
Phase 20 — Portfolio README
Add this to your README.md so recruiters and hiring managers understand what the project demonstrates:
## AI Chatbot with RAG on AzureA production-grade Retrieval-Augmented Generation chatbot built on Azure.
Upload any PDF, DOCX, or TXT file, then ask questions about it.
GPT-4o answers using only the content in your documents — no hallucination.
### What this project demonstrates| Skill area | Implementation ||------------|---------------|| AI/ML integration | Azure OpenAI GPT-4o (chat) + text-embedding-3-small (embeddings) || Vector search | Azure AI Search with HNSW index, hybrid BM25 + vector retrieval || Cloud-native deployment | AKS (private cluster, autoscaler, OIDC Workload Identity) || Security | Defense in depth: Network (VNet/NSG) + Identity (no secrets) + Data (Key Vault) + App (rate limiting, prompt injection guard) || CI/CD | GitHub Actions: lint → test → Docker build → Trivy scan → ACR push → staging → manual gate → prod || Observability | Container Insights, KQL queries, pod restart alerts || IaC | Azure CLI provisioning scripts, Helm charts for k8s, ready to migrate to Terraform |### Architecture[Full architecture diagram in the project page — AzureFixes.com/projects/ai-chatbot-with-rag-on-azure]
### Quick start (local)\`\`\`bash
# Backendcd backend
pip install -r requirements.txt
export KEY_VAULT_URI=https://your-kv.vault.azure.net/
uvicorn app.main:app --reload
# Frontendcd frontend
npm install
NEXT_PUBLIC_BACKEND_URL=http://localhost:8000 npm run dev
\`\`\`