- Published on
- ·14 min read
How to Diagnose Azure ExpressRoute Gateway Control-Plane Failure
If your ExpressRoute Gateway shows BGP: Connected in the portal but cross-premises traffic is completely down, the gateway may have a control-plane failure. This is one of the hardest failure modes to diagnose because every surface-level health indicator — BGP session state, gateway provisioning state, resource health — shows green while the data plane is dropping all traffic.
This guide covers how to confirm the failure, how to fail over to a secondary region, how to rebuild the gateway, and the HA architecture changes that prevent recurrence.
Architecture Overview
This guide assumes a dual ExpressRoute circuit topology: one primary circuit and one failover circuit, both terminated on the same ExpressRoute Gateway in the primary region. On-premises runs two CE routers in BGP AS 65010. Azure runs AS 65515.
Normal traffic flow:
- On-premises sites connect to the primary ASR via MPLS
- Primary circuit (West US 2): active, preferred path via BGP MED
- Failover circuit (East US): hot-standby, accepts traffic within 30 seconds if primary fails
- Hub VNet (10.0.0.0/16) peers to spoke VNets via VNet peering
- ExpressRoute Gateway:
ErGw1AZSKU, Zone-Redundant (recommended; see HA section)
Under normal conditions, the gateway processes ~800 BGP prefixes: 620 from on-prem and 180 Azure routes propagated back to the CE routers.
Identifying the Failure
The symptom pattern that indicates a control-plane failure rather than a BGP session or circuit problem:
What the portal shows (all green, all wrong):
| Component | Status shown |
|---|---|
| ExpressRoute Circuit (primary) | Provisioned — ✓ |
| ExpressRoute Circuit (failover) | Provisioned — ✓ |
| ExpressRoute Gateway | Succeeded |
| BGP peer (primary CE) | Connected |
| BGP peer (failover CE) | Connected |
| Gateway health probe | Healthy |
What you actually have: zero data-plane traffic.
The alert that catches this is ExpressRouteGatewayPacketsPerSecond = 0 for more than 2 minutes. If you don't have this alert, add it — it is the only metric that catches a data-plane failure independently of the BGP session state.
Phase 1 — Confirming the Scope
Before changing anything, confirm what is and isn't working.
From on-premises (CE router):
ASR1001X-primary# show bgp summary
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.0.0.4 4 65515 4821 4819 2841 0 0 2d09h 188
10.0.0.5 4 65515 4820 4818 2841 0 0 2d09h 188
If BGP sessions on the CE router show Up with normal prefix counts, the failure is on the Azure side.
Reachability test from on-prem:
# Ping Azure spoke VM — look for intermittent, not total loss
ping 10.1.4.22 repeat 100 timeout 2
Intermittent loss points to a routing plane issue, not a session being down (which would be total loss).
From Azure — check effective routes on a spoke VM NIC:
az network nic show-effective-route-table \
--resource-group rg-spoke-manufacturing \
--name nic-integration-vm-01 \
--output table | grep "10.10\|10.20\|10.30"
If this returns empty — no on-premises routes in the spoke VM's effective route table — that is the confirmation. Azure VMs have no route to on-premises despite the BGP sessions showing healthy.
Phase 2 — Gateway Diagnostics
Check what the gateway actually knows:
# What routes has the gateway learned from on-prem?
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-westus2 \
--name er-gateway-hub \
--output json | jq 'length'
If this returns 0, the gateway has a control-plane failure. A BGP session can be in the Established state (keepalives exchanging, hold timer refreshing) while the route table is completely empty due to a control-plane processing failure.
# Check the BGP peer status from Azure's side
az network vnet-gateway list-bgp-peer-status \
--resource-group rg-hub-westus2 \
--name er-gateway-hub \
--output table
Neighbor ASN State ConnectedDuration RoutesReceived MessagesSent MessagesReceived
---------- ----- ----------- ------------------- ---------------- -------------- ----------------
10.10.1.1 65010 Connected 2.09:14:32 0 4819 4821
10.10.1.5 65010 Connected 2.09:13:58 0 4818 4820
RoutesReceived: 0 on both peers while State: Connected is the definitive sign of a control-plane failure. The gateway accepted the TCP sessions and is exchanging keepalives but has not processed a single UPDATE message from either CE router.
Checking gateway resource health:
az resource show \
--ids "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-westus2/providers/Microsoft.Network/virtualNetworkGateways/er-gateway-hub" \
--query "properties.provisioningState"
This will return "Succeeded" even when the control plane is broken. The management plane health check polls BGP session state and the gateway's VM-level health — neither catches a FIB programming failure.
Phase 3 — BGP Troubleshooting
Run this diagnostic flowchart to confirm the failure mode before deciding on a remediation path:
KQL — BGP UPDATE processing in gateway diagnostic logs:
AzureDiagnostics
| where ResourceType == "EXPRESSROUTEGATEWAYS"
| where Category == "GatewayDiagnosticLog"
| where TimeGenerated > ago(2h)
| where Message contains "BGP" or Message contains "route"
| project TimeGenerated, OperationName, Message
| order by TimeGenerated desc
If this query returns keepalive logs but no UPDATE RECEIVED entries for several hours or days, the gateway has stopped processing UPDATE messages while still maintaining the BGP TCP session.
The BGP route flow under normal operation:
In a control-plane failure, steps 1 through 3 complete successfully (BGP TCP session up, keepalives exchanging). The failure is at step 4: the control-plane processor parsing UPDATE message TLVs and installing routes into the Forwarding Information Base (FIB). Azure's management plane health checks poll the BGP session state and the gateway's VM-level health — neither catches a FIB programming failure.
Root Cause Pattern
Control-plane route table corruption following a platform firmware update is the documented root cause for this failure pattern. The gateway's non-zone-redundant SKU (Standard) can receive a firmware update that leaves the BGP route processor in a degraded state: it accepts TCP-level BGP session establishment and keepalive traffic but silently drops UPDATE packets without logging the failure or surfacing it via the resource health API.
The degraded state persists until the gateway's internal route refresh timer fires, at which point it clears the stale route table and attempts to re-learn all routes from scratch. With the UPDATE processor still degraded, the re-learn produces an empty routing table. Effective routes on every spoke VNet NIC go to zero and traffic drops.
Why BGP shows Connected: The BGP TCP session and keepalive functions run on a separate process from the UPDATE handler. The keepalive process is healthy; the UPDATE handler is not.
Decision: Reset vs Recreate vs Failover
| Option | Time estimate | Risk |
|---|---|---|
| Gateway reset (soft restart) | ~5 min | May not fix firmware-level failure |
| Gateway delete + recreate | ~45 min | Long outage extension |
| Regional failover to secondary | ~25 min | Low — if failover circuit is provisioned |
| Wait for Azure support | Unknown | Unacceptable for P1 |
If you have a pre-provisioned failover circuit in a secondary region, regional failover is the fastest path to restoring traffic. Use gateway reset first if you want to attempt a quick fix — but be prepared to proceed to failover if it doesn't work within 5 minutes.
Failover Execution
Step 1 — Confirm the failover gateway is healthy:
az network vnet-gateway list-bgp-peer-status \
--resource-group rg-hub-eastus \
--name er-gateway-eastus \
--output table
Verify State: Connected and RoutesReceived is non-zero before proceeding.
Step 2 — Prepend AS path on primary CE to drain traffic toward the failover circuit:
ASR1001X-primary# conf t
ASR1001X-primary(config)# route-map AZURE-OUT permit 10
ASR1001X-primary(config-route-map)# set as-path prepend 65010 65010 65010
ASR1001X-primary(config-route-map)# exit
ASR1001X-primary(config)# router bgp 65010
ASR1001X-primary(config-router)# neighbor 10.0.0.4 route-map AZURE-OUT out
ASR1001X-primary(config-router)# neighbor 10.0.0.5 route-map AZURE-OUT out
ASR1001X-primary(config-router)# end
ASR1001X-primary# clear ip bgp 10.0.0.4 soft out
ASR1001X-primary# clear ip bgp 10.0.0.5 soft out
Step 3 — Update spoke VNet peering to use the secondary region's hub as the gateway transit:
# Disable gateway transit from the broken primary hub
az network vnet peering update \
--resource-group rg-spoke-manufacturing \
--vnet-name vnet-spoke-manufacturing \
--name peer-to-hub-westus2 \
--set useRemoteGateways=false
# Add peering to the secondary hub with gateway transit
az network vnet peering create \
--resource-group rg-spoke-manufacturing \
--vnet-name vnet-spoke-manufacturing \
--name peer-to-hub-eastus-failover \
--remote-vnet /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-eastus/providers/Microsoft.Network/virtualNetworks/vnet-hub-eastus \
--allow-vnet-access true \
--use-remote-gateways true
Verify connectivity restored:
ping 10.1.4.22 repeat 100 timeout 2
# Expected: 100/100 success
Gateway Rebuild
Gateway reset (first attempt):
az network vnet-gateway reset \
--resource-group rg-hub-westus2 \
--name er-gateway-hub
After reset, re-check route learning:
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-westus2 \
--name er-gateway-hub \
--output json | jq 'length'
If still 0, proceed to delete and recreate. Take this opportunity to upgrade to a Zone-Redundant SKU.
Delete and recreate with Zone-Redundant SKU:
# Delete the broken gateway (traffic is on the failover path, no additional outage)
az network vnet-gateway delete \
--resource-group rg-hub-westus2 \
--name er-gateway-hub
# Recreate with Zone-Redundant SKU
az network vnet-gateway create \
--resource-group rg-hub-westus2 \
--name er-gateway-hub \
--location westus2 \
--vnet vnet-hub-westus2 \
--gateway-type ExpressRoute \
--sku ErGw1AZ \
--public-ip-address pip-er-gateway-hub-1 pip-er-gateway-hub-2 \
--no-wait
After recreation (approximately 45 minutes), verify full route learning:
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-westus2 \
--name er-gateway-hub \
--output json | jq 'length'
# Expected: same as pre-failure baseline
Then revert the spoke VNet peering to use the rebuilt primary gateway and remove the failover peering.
Results After Remediation
| Metric | Before failover | After failover | After rebuild |
|---|---|---|---|
| Cross-prem connectivity | ✗ Down | ✓ Via East US | ✓ Via West US 2 |
| Gateway learned routes | 0 | 621 (East US) | 621 (West US 2) |
| Gateway SKU | Standard | Standard (EU) | ErGw1AZ |
| Zone-redundant | No | No | Yes |
Alerting to Add
Alert 1 — Zero routes learned (catches this exact failure mode):
az monitor metrics alert create \
--resource-group rg-hub-westus2 \
--name "ER-GW-ZeroLearnedRoutes" \
--scopes "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-westus2/providers/Microsoft.Network/virtualNetworkGateways/er-gateway-hub" \
--condition "avg ExpressRouteBgpPeerRouteCount < 100" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 1 \
--description "ER Gateway learned fewer than 100 BGP routes — possible control-plane failure"
Alert 2 — Log Analytics: BGP UPDATE absence detection:
// Alert if no BGP UPDATE RECEIVED log entry in last 30 minutes
// when the BGP session is Connected
AzureDiagnostics
| where ResourceType == "EXPRESSROUTEGATEWAYS"
| where Category == "GatewayDiagnosticLog"
| where TimeGenerated > ago(30m)
| where Message contains "UPDATE RECEIVED"
| summarize UpdateCount = count()
| where UpdateCount == 0
This query, run as a Log Analytics scheduled alert every 30 minutes, catches the silent degradation mode before it causes a full outage.
High Availability Recommendations
Change 1 — Zone-Redundant Gateway SKU (ErGw1AZ)
The Standard SKU runs both gateway instances in the same fault domain. A firmware failure affecting that fault domain takes out both instances simultaneously. ErGw1AZ distributes instances across Availability Zones — a zone failure or zone-scoped firmware issue cannot take out the entire gateway.
Change 2 — BGP route count metric alert
ExpressRouteBgpPeerRouteCount < 100 alerts within 5 minutes of the route table going to near-zero. This fires immediately when the route table clears — well before traffic drops are detected by application-layer monitoring.
Change 3 — Automated BGP UPDATE audit
The Log Analytics scheduled alert shown above catches the silent pre-failure degradation mode. Add it — it costs nothing and catches a failure mode that Azure's built-in health APIs do not surface.
Change 4 — Pre-tested failover runbook
Document and test the full failover procedure (the commands in the Failover Execution section above) as an Azure Runbook. Target: under 10 minutes via automation. Untested runbooks run in 20+ minutes under pressure.
Change 5 — Global VNet Peering pre-provisioned
Keep the Global VNet Peering between the primary and secondary Hub VNets active at all times. Spoke VNet peering changes are the only manual steps needed during failover.
Prevention Checklist
| Check | Action |
|---|---|
| Gateway SKU | Use ErGw1AZ, ErGw2AZ, or ErGw3AZ — never Standard for production |
| BGP route count alert | Alert when ExpressRouteBgpPeerRouteCount < 50% of expected baseline |
| BGP UPDATE audit | Log Analytics scheduled query every 30 min during business hours |
| Failover circuit | Pre-provisioned hot-standby circuit in a second region |
| Failover runbook | Tested automation that completes failover in < 10 minutes |
| Peering pre-provisioned | Global VNet Peering between Hub VNets in both regions always active |
| Gateway diagnostic logs | Enable GatewayDiagnosticLog on all ExpressRoute Gateways |
Key Takeaways
1. BGP session state is not a health signal for the data plane.
A BGP session can be in the Established state while the route programming pipeline is completely broken. Design your monitoring to check learned routes and effective routes, not just session state.
2. The route count metric is your early warning system.
ExpressRouteBgpPeerRouteCount dropping to zero is unambiguous. The PacketsPerSecond metric dropping to zero is a consequence — it tells you the outage has started, not that it's about to start. Monitor the cause, not just the effect.
3. Gateway reset does not fix firmware-level failures.
az network vnet-gateway reset is correct for stuck BGP sessions and transient disconnects. It does not fix firmware-layer corruption. Know when to skip to recreate.
4. Zone-redundant SKUs are not optional for production ExpressRoute.
The Standard SKU's single fault domain is exactly the attack surface a firmware failure exploits. ErGw1AZ distributes instances across zones — Azure's zone-health monitoring detects an unhealthy zone and shifts traffic to a healthy instance before the failure surfaces to users.
5. Failover circuits must be tested, not just provisioned.
A pre-provisioned failover circuit that has never been tested can take 20+ minutes under pressure. With a documented runbook and one practice run, the same procedure takes under 10 minutes.
6. Silent failures need active detection.
A control-plane UPDATE processing failure is invisible to standard monitoring until the route refresh timer fires and traffic drops. The BGP UPDATE audit query costs almost nothing and catches this failure mode days before it surfaces.