AzureFixes Logo
AZUREFIXES
DEBUG FASTER. DEPLOY SMARTER.
How to Diagnose Azure ExpressRoute Gateway Control-Plane Failure
Published on
14 min read

How to Diagnose Azure ExpressRoute Gateway Control-Plane Failure

If your ExpressRoute Gateway shows BGP: Connected in the portal but cross-premises traffic is completely down, the gateway may have a control-plane failure. This is one of the hardest failure modes to diagnose because every surface-level health indicator — BGP session state, gateway provisioning state, resource health — shows green while the data plane is dropping all traffic.

This guide covers how to confirm the failure, how to fail over to a secondary region, how to rebuild the gateway, and the HA architecture changes that prevent recurrence.


Architecture Overview

This guide assumes a dual ExpressRoute circuit topology: one primary circuit and one failover circuit, both terminated on the same ExpressRoute Gateway in the primary region. On-premises runs two CE routers in BGP AS 65010. Azure runs AS 65515.

Reference architecture: dual ExpressRoute circuits, active-active BGP sessions, Hub-Spoke VNet layout

Normal traffic flow:

  • On-premises sites connect to the primary ASR via MPLS
  • Primary circuit (West US 2): active, preferred path via BGP MED
  • Failover circuit (East US): hot-standby, accepts traffic within 30 seconds if primary fails
  • Hub VNet (10.0.0.0/16) peers to spoke VNets via VNet peering
  • ExpressRoute Gateway: ErGw1AZ SKU, Zone-Redundant (recommended; see HA section)

Under normal conditions, the gateway processes ~800 BGP prefixes: 620 from on-prem and 180 Azure routes propagated back to the CE routers.


Identifying the Failure

The symptom pattern that indicates a control-plane failure rather than a BGP session or circuit problem:

What the portal shows (all green, all wrong):

ComponentStatus shown
ExpressRoute Circuit (primary)Provisioned — ✓
ExpressRoute Circuit (failover)Provisioned — ✓
ExpressRoute GatewaySucceeded
BGP peer (primary CE)Connected
BGP peer (failover CE)Connected
Gateway health probeHealthy

What you actually have: zero data-plane traffic.

The alert that catches this is ExpressRouteGatewayPacketsPerSecond = 0 for more than 2 minutes. If you don't have this alert, add it — it is the only metric that catches a data-plane failure independently of the BGP session state.


Phase 1 — Confirming the Scope

Before changing anything, confirm what is and isn't working.

From on-premises (CE router):

ASR1001X-primary# show bgp summary
Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd
10.0.0.4        4 65515    4821    4819     2841    0    0  2d09h        188
10.0.0.5        4 65515    4820    4818     2841    0    0  2d09h        188

If BGP sessions on the CE router show Up with normal prefix counts, the failure is on the Azure side.

Reachability test from on-prem:

# Ping Azure spoke VM — look for intermittent, not total loss
ping 10.1.4.22 repeat 100 timeout 2

Intermittent loss points to a routing plane issue, not a session being down (which would be total loss).

From Azure — check effective routes on a spoke VM NIC:

az network nic show-effective-route-table \
  --resource-group rg-spoke-manufacturing \
  --name nic-integration-vm-01 \
  --output table | grep "10.10\|10.20\|10.30"

If this returns empty — no on-premises routes in the spoke VM's effective route table — that is the confirmation. Azure VMs have no route to on-premises despite the BGP sessions showing healthy.


Phase 2 — Gateway Diagnostics

Check what the gateway actually knows:

# What routes has the gateway learned from on-prem?
az network vnet-gateway list-learned-routes \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub \
  --output json | jq 'length'

If this returns 0, the gateway has a control-plane failure. A BGP session can be in the Established state (keepalives exchanging, hold timer refreshing) while the route table is completely empty due to a control-plane processing failure.

# Check the BGP peer status from Azure's side
az network vnet-gateway list-bgp-peer-status \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub \
  --output table
Neighbor    ASN    State        ConnectedDuration    RoutesReceived    MessagesSent    MessagesReceived
----------  -----  -----------  -------------------  ----------------  --------------  ----------------
10.10.1.1   65010  Connected    2.09:14:32           0                 4819            4821
10.10.1.5   65010  Connected    2.09:13:58           0                 4818            4820

RoutesReceived: 0 on both peers while State: Connected is the definitive sign of a control-plane failure. The gateway accepted the TCP sessions and is exchanging keepalives but has not processed a single UPDATE message from either CE router.

Checking gateway resource health:

az resource show \
  --ids "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-westus2/providers/Microsoft.Network/virtualNetworkGateways/er-gateway-hub" \
  --query "properties.provisioningState"

This will return "Succeeded" even when the control plane is broken. The management plane health check polls BGP session state and the gateway's VM-level health — neither catches a FIB programming failure.


Phase 3 — BGP Troubleshooting

Run this diagnostic flowchart to confirm the failure mode before deciding on a remediation path:

The 5-step diagnostic flowchart. If BGP session state (step 1) passes but route count (step 2) is zero, you have a control-plane failure — proceed to failover.

KQL — BGP UPDATE processing in gateway diagnostic logs:

AzureDiagnostics
| where ResourceType == "EXPRESSROUTEGATEWAYS"
| where Category == "GatewayDiagnosticLog"
| where TimeGenerated > ago(2h)
| where Message contains "BGP" or Message contains "route"
| project TimeGenerated, OperationName, Message
| order by TimeGenerated desc

If this query returns keepalive logs but no UPDATE RECEIVED entries for several hours or days, the gateway has stopped processing UPDATE messages while still maintaining the BGP TCP session.

The BGP route flow under normal operation:

BGP route programming flow. A control-plane failure breaks at step 4 — UPDATE processing. Routes are never installed into the FIB despite the session appearing healthy.

In a control-plane failure, steps 1 through 3 complete successfully (BGP TCP session up, keepalives exchanging). The failure is at step 4: the control-plane processor parsing UPDATE message TLVs and installing routes into the Forwarding Information Base (FIB). Azure's management plane health checks poll the BGP session state and the gateway's VM-level health — neither catches a FIB programming failure.


Root Cause Pattern

Control-plane route table corruption following a platform firmware update is the documented root cause for this failure pattern. The gateway's non-zone-redundant SKU (Standard) can receive a firmware update that leaves the BGP route processor in a degraded state: it accepts TCP-level BGP session establishment and keepalive traffic but silently drops UPDATE packets without logging the failure or surfacing it via the resource health API.

The degraded state persists until the gateway's internal route refresh timer fires, at which point it clears the stale route table and attempts to re-learn all routes from scratch. With the UPDATE processor still degraded, the re-learn produces an empty routing table. Effective routes on every spoke VNet NIC go to zero and traffic drops.

Why BGP shows Connected: The BGP TCP session and keepalive functions run on a separate process from the UPDATE handler. The keepalive process is healthy; the UPDATE handler is not.


Decision: Reset vs Recreate vs Failover

OptionTime estimateRisk
Gateway reset (soft restart)~5 minMay not fix firmware-level failure
Gateway delete + recreate~45 minLong outage extension
Regional failover to secondary~25 minLow — if failover circuit is provisioned
Wait for Azure supportUnknownUnacceptable for P1

If you have a pre-provisioned failover circuit in a secondary region, regional failover is the fastest path to restoring traffic. Use gateway reset first if you want to attempt a quick fix — but be prepared to proceed to failover if it doesn't work within 5 minutes.

Regional failover: traffic reroutes from the failed primary gateway to the healthy secondary via the pre-provisioned failover circuit

Failover Execution

Step 1 — Confirm the failover gateway is healthy:

az network vnet-gateway list-bgp-peer-status \
  --resource-group rg-hub-eastus \
  --name er-gateway-eastus \
  --output table

Verify State: Connected and RoutesReceived is non-zero before proceeding.

Step 2 — Prepend AS path on primary CE to drain traffic toward the failover circuit:

ASR1001X-primary# conf t
ASR1001X-primary(config)# route-map AZURE-OUT permit 10
ASR1001X-primary(config-route-map)# set as-path prepend 65010 65010 65010
ASR1001X-primary(config-route-map)# exit
ASR1001X-primary(config)# router bgp 65010
ASR1001X-primary(config-router)# neighbor 10.0.0.4 route-map AZURE-OUT out
ASR1001X-primary(config-router)# neighbor 10.0.0.5 route-map AZURE-OUT out
ASR1001X-primary(config-router)# end
ASR1001X-primary# clear ip bgp 10.0.0.4 soft out
ASR1001X-primary# clear ip bgp 10.0.0.5 soft out

Step 3 — Update spoke VNet peering to use the secondary region's hub as the gateway transit:

# Disable gateway transit from the broken primary hub
az network vnet peering update \
  --resource-group rg-spoke-manufacturing \
  --vnet-name vnet-spoke-manufacturing \
  --name peer-to-hub-westus2 \
  --set useRemoteGateways=false

# Add peering to the secondary hub with gateway transit
az network vnet peering create \
  --resource-group rg-spoke-manufacturing \
  --vnet-name vnet-spoke-manufacturing \
  --name peer-to-hub-eastus-failover \
  --remote-vnet /subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-eastus/providers/Microsoft.Network/virtualNetworks/vnet-hub-eastus \
  --allow-vnet-access true \
  --use-remote-gateways true

Verify connectivity restored:

ping 10.1.4.22 repeat 100 timeout 2
# Expected: 100/100 success

Gateway Rebuild

Gateway reset (first attempt):

az network vnet-gateway reset \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub

After reset, re-check route learning:

az network vnet-gateway list-learned-routes \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub \
  --output json | jq 'length'

If still 0, proceed to delete and recreate. Take this opportunity to upgrade to a Zone-Redundant SKU.

Delete and recreate with Zone-Redundant SKU:

# Delete the broken gateway (traffic is on the failover path, no additional outage)
az network vnet-gateway delete \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub

# Recreate with Zone-Redundant SKU
az network vnet-gateway create \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub \
  --location westus2 \
  --vnet vnet-hub-westus2 \
  --gateway-type ExpressRoute \
  --sku ErGw1AZ \
  --public-ip-address pip-er-gateway-hub-1 pip-er-gateway-hub-2 \
  --no-wait

After recreation (approximately 45 minutes), verify full route learning:

az network vnet-gateway list-learned-routes \
  --resource-group rg-hub-westus2 \
  --name er-gateway-hub \
  --output json | jq 'length'
# Expected: same as pre-failure baseline

Then revert the spoke VNet peering to use the rebuilt primary gateway and remove the failover peering.


Results After Remediation

MetricBefore failoverAfter failoverAfter rebuild
Cross-prem connectivity✗ Down✓ Via East US✓ Via West US 2
Gateway learned routes0621 (East US)621 (West US 2)
Gateway SKUStandardStandard (EU)ErGw1AZ
Zone-redundantNoNoYes

Alerting to Add

Alert 1 — Zero routes learned (catches this exact failure mode):

az monitor metrics alert create \
  --resource-group rg-hub-westus2 \
  --name "ER-GW-ZeroLearnedRoutes" \
  --scopes "/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/rg-hub-westus2/providers/Microsoft.Network/virtualNetworkGateways/er-gateway-hub" \
  --condition "avg ExpressRouteBgpPeerRouteCount < 100" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 1 \
  --description "ER Gateway learned fewer than 100 BGP routes — possible control-plane failure"

Alert 2 — Log Analytics: BGP UPDATE absence detection:

// Alert if no BGP UPDATE RECEIVED log entry in last 30 minutes
// when the BGP session is Connected
AzureDiagnostics
| where ResourceType == "EXPRESSROUTEGATEWAYS"
| where Category == "GatewayDiagnosticLog"
| where TimeGenerated > ago(30m)
| where Message contains "UPDATE RECEIVED"
| summarize UpdateCount = count()
| where UpdateCount == 0

This query, run as a Log Analytics scheduled alert every 30 minutes, catches the silent degradation mode before it causes a full outage.


High Availability Recommendations

Post-incident HA reference architecture: zone-redundant gateways in two regions, dual circuits per region, automatic BGP health monitoring

Change 1 — Zone-Redundant Gateway SKU (ErGw1AZ)

The Standard SKU runs both gateway instances in the same fault domain. A firmware failure affecting that fault domain takes out both instances simultaneously. ErGw1AZ distributes instances across Availability Zones — a zone failure or zone-scoped firmware issue cannot take out the entire gateway.

Change 2 — BGP route count metric alert

ExpressRouteBgpPeerRouteCount < 100 alerts within 5 minutes of the route table going to near-zero. This fires immediately when the route table clears — well before traffic drops are detected by application-layer monitoring.

Change 3 — Automated BGP UPDATE audit

The Log Analytics scheduled alert shown above catches the silent pre-failure degradation mode. Add it — it costs nothing and catches a failure mode that Azure's built-in health APIs do not surface.

Change 4 — Pre-tested failover runbook

Document and test the full failover procedure (the commands in the Failover Execution section above) as an Azure Runbook. Target: under 10 minutes via automation. Untested runbooks run in 20+ minutes under pressure.

Change 5 — Global VNet Peering pre-provisioned

Keep the Global VNet Peering between the primary and secondary Hub VNets active at all times. Spoke VNet peering changes are the only manual steps needed during failover.


Prevention Checklist

CheckAction
Gateway SKUUse ErGw1AZ, ErGw2AZ, or ErGw3AZ — never Standard for production
BGP route count alertAlert when ExpressRouteBgpPeerRouteCount < 50% of expected baseline
BGP UPDATE auditLog Analytics scheduled query every 30 min during business hours
Failover circuitPre-provisioned hot-standby circuit in a second region
Failover runbookTested automation that completes failover in < 10 minutes
Peering pre-provisionedGlobal VNet Peering between Hub VNets in both regions always active
Gateway diagnostic logsEnable GatewayDiagnosticLog on all ExpressRoute Gateways

Key Takeaways

1. BGP session state is not a health signal for the data plane.

A BGP session can be in the Established state while the route programming pipeline is completely broken. Design your monitoring to check learned routes and effective routes, not just session state.

2. The route count metric is your early warning system.

ExpressRouteBgpPeerRouteCount dropping to zero is unambiguous. The PacketsPerSecond metric dropping to zero is a consequence — it tells you the outage has started, not that it's about to start. Monitor the cause, not just the effect.

3. Gateway reset does not fix firmware-level failures.

az network vnet-gateway reset is correct for stuck BGP sessions and transient disconnects. It does not fix firmware-layer corruption. Know when to skip to recreate.

4. Zone-redundant SKUs are not optional for production ExpressRoute.

The Standard SKU's single fault domain is exactly the attack surface a firmware failure exploits. ErGw1AZ distributes instances across zones — Azure's zone-health monitoring detects an unhealthy zone and shifts traffic to a healthy instance before the failure surfaces to users.

5. Failover circuits must be tested, not just provisioned.

A pre-provisioned failover circuit that has never been tested can take 20+ minutes under pressure. With a documented runbook and one practice run, the same procedure takes under 10 minutes.

6. Silent failures need active detection.

A control-plane UPDATE processing failure is invisible to standard monitoring until the route refresh timer fires and traffic drops. The BGP UPDATE audit query costs almost nothing and catches this failure mode days before it surfaces.