- Published on
- ·9 min read
Fixing Azure VPN Gateway CPU Spikes: A Route Explosion RCA
At 02:47 UTC on a Tuesday, our on-call alert fired: VPN Gateway CPU on both active-active instances above 90% for 15 minutes. BGP sessions were flapping. Two site-to-site tunnels had already dropped. Workloads that depended on cross-premises connectivity were timing out.
This is the RCA, the fix, and the architectural change we made to prevent it recurring.
The Symptom
Alert: AzureVPNGateway CPU Utilization > 85% for 15 minutes
What we saw in the portal:
- Both gateway instances: 92–97% CPU
- BGP peer state:
Connecting(flapping every 3–5 minutes) - Active tunnels: dropped from 6 to 4
- Traffic: ~340 Mbps total across tunnels (well under the VpnGw1 throughput limit of 650 Mbps)
The traffic volume ruled out a throughput problem immediately. CPU at 95% with only 340 Mbps of traffic meant the gateway was spending cycles on something other than forwarding packets.
BGP session drops in the Activity Log:
2026-06-17T02:41:08Z BGPPeerSessionDown peer=10.10.1.1 gateway=vpn-hub-eastus
2026-06-17T02:44:12Z BGPPeerSessionDown peer=10.10.1.1 gateway=vpn-hub-eastus
2026-06-17T02:47:39Z BGPPeerSessionDown peer=10.10.1.1 gateway=vpn-hub-eastus
The BGP session to the primary on-premises CE router was resetting on a cycle. Each reset triggers a full route table exchange on reconnect, which is exactly when CPU spikes — creating a feedback loop where high CPU causes a BGP drop, which triggers a reconvergence, which causes high CPU.
Investigation
Step 1 — Check Route Count
# Count BGP routes learned from on-prem
az network vnet-gateway list-bgp-peer-status \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--output table
# Get learned routes (all peers combined)
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--output json | jq 'length'
Output: 2,412
That was the number. 2,412 BGP routes being processed, stored, and programmed into every NIC across every VM in every peered VNet. VpnGw1 is not designed to handle that volume of BGP state.
Step 2 — Examine the Routes
# Sample the first 20 routes to see the pattern
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--output json | jq '[.[].network] | sort | .[0:20]'
[
"10.10.1.0/30",
"10.10.1.4/30",
"10.10.1.8/30",
"10.10.1.12/30",
"10.10.1.16/30",
"10.10.1.20/30",
"10.10.1.24/30",
"10.10.1.28/30",
"10.10.2.0/30",
"10.10.2.4/30",
"10.10.2.8/30",
"10.10.2.12/30",
...
]
Every /30 point-to-point link and /32 loopback in the on-premises network was being individually advertised into BGP. The network team had migrated from a static routing setup six months earlier and simply exported the full on-prem IP plan into BGP without any aggregation.
Step 3 — Confirm the Gateway SKU Limit
The Azure documentation for VpnGw1 states a maximum of 100 BGP routes recommended per peer. We were at 2,400. The undocumented behavior at this scale: the gateway processes BGP updates but routing table programming to Azure VMs becomes slow, BGP keepalive processing gets deprioritized, and sessions eventually time out.
Step 4 — Check for Other Contributors
# IKE SA count — rekey storms can also spike CPU
az network vnet-gateway show \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--query 'ipsecPolicies'
# Connection count
az network vpn-connection list \
--resource-group rg-hub-eastus \
--output table
Six active site-to-site connections, all with default IKE settings (rekey interval: 8 hours). No IKE rekey storm — the six connections had staggered rekey times from different establishment dates. The BGP route count was the sole root cause.
Architecture Diagram
The Fix
Fix 1 — Route Summarization on the On-Premises CE Router
The on-prem team had two Cisco IOS-XE CE routers (one per site). We configured route aggregation on both to suppress the specific /30 and /32 routes and only advertise supernet summaries.
Cisco IOS-XE — aggregate address configuration:
router bgp 65000
address-family ipv4 unicast
aggregate-address 10.10.0.0 255.255.0.0 summary-only
aggregate-address 10.20.0.0 255.255.0.0 summary-only
aggregate-address 10.30.0.0 255.255.0.0 summary-only
aggregate-address 172.16.0.0 255.252.0.0 summary-only
aggregate-address 192.168.0.0 255.255.252.0 summary-only
no auto-summary
summary-only suppresses the more-specific routes — only the aggregate is advertised to eBGP peers. This reduced the advertised prefix count from 2,412 to 12.
Verify on the Azure side after applying:
az network vnet-gateway list-learned-routes \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--output json | jq 'length'
# Expected: 12
Fix 2 — Add a BGP Route Filter on Azure (Defense in Depth)
Even with summarization on the on-prem side, we added an inbound BGP route filter as a safety net to prevent a future on-prem misconfiguration from flooding the gateway again. This is done via a local network gateway route filter.
For a Local Network Gateway (policy-based):
# Set max prefixes accepted from this peer
az network local-gateway update \
--resource-group rg-hub-eastus \
--name lng-onprem-primary \
--local-address-prefixes "10.10.0.0/16" "10.20.0.0/16" "10.30.0.0/16" \
"172.16.0.0/14" "192.168.0.0/22"
For route-based connections with BGP, implement prefix filtering on the on-prem router using prefix-list and a route-map:
ip prefix-list AZURE-IN seq 10 permit 0.0.0.0/0 le 22
ip prefix-list AZURE-IN seq 20 deny 0.0.0.0/0 le 32
router bgp 65000
neighbor 10.0.0.4 route-map LIMIT-AZURE-BGP in
route-map LIMIT-AZURE-BGP permit 10
match ip address prefix-list AZURE-IN
This rejects any prefix longer than /22 from being accepted from Azure peers — preventing a fat-finger from accidentally advertising host routes.
Fix 3 — Upgrade the Gateway SKU
With summarization in place, VpnGw1 would now be within specification. However, with six active tunnels and traffic trending upward, we took the opportunity to upgrade to VpnGw2 during the next maintenance window.
# Resize the gateway (causes ~30 seconds of downtime per instance in active-standby;
# active-active: brief failover to the other instance)
az network vnet-gateway update \
--resource-group rg-hub-eastus \
--name vpn-hub-eastus \
--sku VpnGw2
VpnGw2 provides:
- 1.25 Gbps aggregate throughput (vs 650 Mbps on GW1)
- Higher BGP route table capacity
- More IKE SA processing headroom
The resize took 18 minutes. BGP sessions reconverged within 90 seconds of the resize completing.
Results
| Metric | Before | After |
|---|---|---|
| BGP prefixes learned | 2,412 | 12 |
| VPN GW CPU (avg) | 92–97% | 35–42% |
| BGP session flaps (per hour) | 4–6 | 0 |
| Active tunnels | 4 of 6 | 6 of 6 |
| VPN GW SKU | VpnGw1 | VpnGw2 |
Alerting We Added After the Incident
Alert 1 — BGP route count threshold:
az monitor metrics alert create \
--resource-group rg-hub-eastus \
--name "VNG-BGP-RouteCount-High" \
--scopes "/subscriptions/.../resourceGroups/rg-hub-eastus/providers/Microsoft.Network/virtualNetworkGateways/vpn-hub-eastus" \
--condition "avg TotalBGPLearnedPrefixCount > 200" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action-group "/subscriptions/.../resourceGroups/rg-hub-eastus/providers/Microsoft.Insights/actionGroups/ag-network"
Alert 2 — VPN Gateway CPU:
az monitor metrics alert create \
--resource-group rg-hub-eastus \
--name "VNG-CPU-High" \
--scopes "/subscriptions/.../resourceGroups/rg-hub-eastus/providers/Microsoft.Network/virtualNetworkGateways/vpn-hub-eastus" \
--condition "avg TunnelAverageBandwidth < 1000" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 1 \
--action-group "/subscriptions/.../resourceGroups/rg-hub-eastus/providers/Microsoft.Insights/actionGroups/ag-network"
The native GatewayAverageBandwidth metric captures tunnel throughput but not CPU directly. For CPU, use the Gateway Diagnostics log category in Log Analytics:
AzureDiagnostics
| where ResourceType == "VIRTUALNETWORKGATEWAYS"
| where Category == "GatewayDiagnosticLog"
| where OperationName == "GatewayPerformanceMonitor"
| extend cpuPct = toint(Message)
| summarize maxCPU = max(cpuPct) by bin(TimeGenerated, 5m)
| where maxCPU > 80
| order by TimeGenerated desc
Prevention Checklist
Before connecting a new on-premises network to Azure VPN Gateway:
| Check | Command |
|---|---|
| Audit BGP prefix count from the CE router | show ip bgp summary → count prefixes |
| Verify summarization is configured | show ip bgp neighbors x.x.x.x advertised-routes | count — target <50 |
| Set prefix limits on CE router | neighbor x.x.x.x maximum-prefix 100 warning-only |
| Choose correct VNG SKU for scale | VpnGw2+ if >4 tunnels or >650 Mbps or >100 BGP routes |
| Enable BGP diagnostics in Azure | Portal → VPN GW → Diagnostics → GatewayDiagnosticLog → enable |
| Set CPU alert before go-live | Metric: GatewayAverageBandwidth + GatewayDiagnosticLog |
Why VpnGw1 Specifically Struggles With Route Bloat
The VPN Gateway appliance processes BGP in-line on the same compute resources that handle IKE negotiations and IPSec encryption. On VpnGw1, these resources are limited enough that a large BGP routing table causes the BGP process to consume CPU cycles that should be handling keepalives and data-path encryption.
When the BGP process can't send keepalives within the hold-timer (default: 90 seconds), the remote peer declares the session dead and resets it. The reset triggers a full BGP table exchange on reconnect. That exchange is even more CPU-intensive than steady-state route maintenance — it's a batch upload of the entire table — which is why the feedback loop is so destructive.
The fix isn't really about making BGP "faster" on the gateway. It's about keeping the route table small enough that BGP processing is trivial compared to the gateway's available CPU budget.
Lessons Learned
Route summarization is not optional in BGP-based VPN topologies. Any on-premises migration from static routing to BGP must include an aggregation design review before cutover.
The BGP route limit for VpnGw1 is effectively ~100 prefixes per peer, not the documented maximum. At scale, the gateway starts showing stress well before hitting a hard ceiling.
Active-active gateways don't protect against CPU-based failures. Both instances run the same BGP process against the same route table. If the table is too large, both instances spike simultaneously.
Add a BGP route count alert before connecting new sites. The alert should fire at 100-200 prefixes — well before the 2,000+ threshold that causes instability.
A resize to VpnGw2 is low-risk and fast. 18 minutes of maintenance window, ~30 seconds of actual downtime in active-active mode, and a meaningful increase in headroom for the cost of roughly $100/month more than VpnGw1.