- Published on
- ·5 min read
Prompt Engineering on Azure: How to Version Your Prompts
Prompts are code.
They determine how your AI system behaves. A one-word change can shift responses from correct to wrong, from safe to problematic, from helpful to verbose. And unlike application code, prompt changes are invisible to most observability systems.
Most teams treat prompts like config files — edit in production, hope for the best. Here's how to manage them properly.
Why Prompt Versioning Breaks Down
The typical path of prompt management:
- Developer hardcodes prompt in Python file during prototyping
- Prompt works, feature ships, prompt stays in code
- Six months later, someone "just tweaks the wording" in a hot patch
- AI quality degrades, nobody knows why because nothing was tracked
- Team spends two weeks debugging before finding the prompt change
The problem isn't that prompts changed. The problem is that nobody can trace which change caused the quality drop, and there's no way to roll back to the version that was working.
Azure App Configuration: The Right Store for Prompts
Azure App Configuration is designed for dynamic configuration at scale. It supports:
- Labels: tag configs as
dev,staging,production - Key vault references: store API keys as references, not values
- Versioning: every change creates a new snapshot, with point-in-time restore
- Feature flags: gate prompt variants by user segment or percentage rollout
# Create App Configuration resource
az appconfig create \
--name azurefixes-ai-config \
--resource-group rg-ai \
--location eastus \
--sku Standard
# Store your system prompt with label and content type
az appconfig kv set \
--name azurefixes-ai-config \
--key "prompts:chatbot:system" \
--value "You are a helpful Azure technical assistant. Answer based only on the provided context. If the context does not contain the answer, say so clearly." \
--label production \
--content-type "text/plain"
Reading prompts in Python:
from azure.appconfiguration import AzureAppConfigurationClient
from azure.identity import DefaultAzureCredential
class PromptStore:
def __init__(self, endpoint: str):
self.client = AzureAppConfigurationClient(
base_url=endpoint,
credential=DefaultAzureCredential()
)
def get(self, key: str, label: str = "production") -> str:
setting = self.client.get_configuration_setting(key=key, label=label)
return setting.value
def list_versions(self, key: str) -> list[dict]:
# Use revision history to see all versions
revisions = self.client.list_revisions(key_filter=key)
return [
{"etag": r.etag, "last_modified": r.last_modified, "value": r.value}
for r in revisions
]
store = PromptStore(endpoint="https://azurefixes-ai-config.azconfig.io")
system_prompt = store.get("prompts:chatbot:system")
Semantic Versioning for Prompts
Treat prompt changes the same way you'd treat API changes:
- Patch (1.0.0 → 1.0.1): Typo fix, rephrasing, no behavior change expected
- Minor (1.0.0 → 1.1.0): New instruction added, behavior expected to improve
- Major (1.0.0 → 2.0.0): Structural change, persona change, expected behavior difference
Store the version alongside the prompt:
# Prompt metadata stored as separate config keys
store.set("prompts:chatbot:system:version", "1.2.0", label="production")
store.set("prompts:chatbot:system:changed_by", "nvn@azurefixes.com", label="production")
store.set("prompts:chatbot:system:change_reason", "Added instruction to cite source documents", label="production")
Log the prompt version with every request:
def call_ai(query: str, context: list[str]) -> str:
prompt_version = store.get("prompts:chatbot:system:version")
system_prompt = store.get("prompts:chatbot:system")
response = client.chat.completions.create(...)
logging.info({
"event": "ai_request",
"prompt_version": prompt_version,
"request_id": str(uuid4()),
...
})
Now when quality drops, you can correlate the drop with a specific prompt version change.
A/B Testing Prompts
Azure App Configuration's feature flags support percentage rollouts — exactly what you need for A/B testing prompt variants.
# Create a feature flag for prompt variant
az appconfig feature set \
--name azurefixes-ai-config \
--feature "new-system-prompt" \
--description "Testing more concise system prompt variant"
# Enable for 20% of requests
az appconfig feature filter add \
--name azurefixes-ai-config \
--feature "new-system-prompt" \
--filter-name Microsoft.Percentage \
--filter-parameters '{"Value": "20"}'
In your application:
from azure.appconfiguration.provider import load
config = load(
endpoint="https://azurefixes-ai-config.azconfig.io",
credential=DefaultAzureCredential(),
feature_flag_enabled=True
)
def get_system_prompt(user_id: str) -> tuple[str, str]:
# Feature flag evaluates per-user percentage roll
if config.feature_manager.is_enabled("new-system-prompt"):
prompt = store.get("prompts:chatbot:system", label="variant-b")
variant = "b"
else:
prompt = store.get("prompts:chatbot:system", label="production")
variant = "a"
return prompt, variant
Log the variant with every response. After 1,000 requests, compare:
- Average groundedness score by variant
- User thumbs-up/down rate by variant
- Average completion tokens (shorter = often better, sometimes worse)
Rollback Strategy
The most important operational capability: being able to roll back a prompt change in under five minutes.
# List revision history for a prompt key
az appconfig revision list \
--name azurefixes-ai-config \
--key "prompts:chatbot:system" \
--label production
# Restore to a previous revision using its etag
az appconfig kv set \
--name azurefixes-ai-config \
--key "prompts:chatbot:system" \
--value "$(az appconfig revision list --name azurefixes-ai-config --key prompts:chatbot:system --label production --query '[1].value' -o tsv)" \
--label production
Alternatively, store a prompts:chatbot:system:backup key with the last known-good version. Roll back by copying it to the active key.
The Prompt Change Process
Before any prompt change ships to production:
- Write the change and the reason for it
- Run your eval suite on the new prompt (50+ test cases)
- Compare scores — groundedness, relevance, response length
- Deploy to staging with label
staging, test with real queries - Gradual rollout — 5% production traffic for 24 hours, monitor metrics
- Full rollout or rollback based on signal
A prompt that passes this process ships with confidence. A prompt that bypasses it ships with risk.
The teams that ship AI products that get better over time are the teams that treat prompts like code.