Fixing Calico BGP Peer 'Connection Refused': Why Your BGP Session Won't Establish and How to Resolve It
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Calico's BIRD BGP daemon cannot reach the peer IP on TCP/179 — the remote end is actively refusing the connection, meaning BGP sessions are down and inter-node pod routing is black-holing traffic.
- How to fix it: Verify the peer IP, AS numbers, node-to-node firewall rules on port 179, and whether
bgpis enabled on the target node's Calico config. Confirm BIRD is running on the peer. - Use our Client-Side Sandbox below to auto-refactor this — paste your
BGPPeerYAML orcalicoctl node statusoutput and get a corrected config without sending your topology to a third-party AI.
The Incident (What does the error mean?)
Raw output from calicoctl node status:
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+----------+-------------+
| 10.0.1.45 | node-to-node mesh | start | 04:12:33 | Connect |
| 192.168.10.1 | global | start | 04:12:33 | Active |
+---------------+-------------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
Or from BIRD logs (kubectl logs -n kube-system <calico-node-pod> -c calico-node):
2024-01-15 04:13:01.456 [ERROR] Connection to BGP neighbor 10.0.1.45 failed: Connection refused
2024-01-15 04:13:01.457 [INFO] BGP session with 10.0.1.45 moved to ACTIVE state
Immediate consequence: BGP is in Connect or Active state — never Established. Calico is not exchanging routes. Every pod on the affected node cannot reach pods on the peer node. kubectl exec cross-node calls hang. Services backed by pods on the unreachable node return connection timeouts. This is a full network partition for affected node pairs.
The Attack Vector / Blast Radius
This is not a flap — Connection refused means TCP/179 is being actively rejected by the remote host's kernel or firewall. This is distinct from a timeout (firewall drop) or a BGP NOTIFICATION (protocol-level rejection).
Cascading failure chain:
- BIRD on the local node attempts TCP handshake to peer:179.
- Remote kernel sends RST — port closed or firewall REJECT rule hit.
- BIRD backs off, retries on exponential timer (up to 120s intervals).
- No BGP routes exchanged → Calico Felix has no remote workload routes → programs no routes into the kernel routing table.
- All cross-node pod traffic is unroutable. In
vxlanmode this is masked slightly longer; in native BGP mode it's instant. - If this is a ToR/upstream router peer (global BGP peer for on-prem), your entire cluster loses external reachability — not just cross-node pod traffic.
In multi-tenant clusters, this silently breaks network policies that depend on cross-node enforcement, creating a false sense of isolation while traffic is simply dropped rather than policy-denied.
How to Fix It
Step 0: Confirm the exact failure mode
# From the affected calico-node pod
kubectl exec -n kube-system <calico-node-pod> -- nc -zv <PEER_IP> 179
# "Connection refused" = port closed or REJECT rule
# Timeout = DROP rule or wrong IP
# Success = BIRD config issue, not network
# Check BIRD daemon is alive on the PEER node
kubectl exec -n kube-system <calico-node-pod-on-peer> -- pgrep -a bird
Basic Fix — Firewall / Security Group on TCP/179
The most common cause in cloud environments (AWS, GCP, Azure) is missing inbound rules for BGP.
# AWS Security Group (Terraform)
resource "aws_security_group_rule" "node_bgp" {
type = "ingress"
- # Missing BGP rule — nodes can't peer
+ from_port = 179
+ to_port = 179
+ protocol = "tcp"
+ self = true
security_group_id = aws_security_group.nodes.id
}
For iptables-based hosts:
- # No rule permitting TCP/179
+ iptables -I INPUT -p tcp --dport 179 -s <PEER_CIDR> -j ACCEPT
+ iptables -I OUTPUT -p tcp --sport 179 -d <PEER_CIDR> -j ACCEPT
Basic Fix — Mismatched AS Numbers
# BGPPeer YAML
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: peer-to-tor-switch
spec:
peerIP: 192.168.10.1
- asNumber: 64512
+ asNumber: 65001 # Must match the AS configured on the ToR switch
node: worker-node-01
Basic Fix — BGP disabled on node or wrong node selector
# BGPConfiguration
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
- nodeToNodeMeshEnabled: false
+ nodeToNodeMeshEnabled: true
asNumber: 64512
Or if using node-specific peer with wrong selector:
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: rack1-peer
spec:
peerIP: 10.0.1.45
asNumber: 65000
- nodeSelector: "rack == 'rack2'" # Wrong rack label, peer node excluded
+ nodeSelector: "rack == 'rack1'"
Enterprise Best Practice — Structured BGP peer config with password auth and explicit selectors
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: upstream-tor-rack1
spec:
peerIP: 10.100.0.1
asNumber: 65000
+ password:
+ secretKeyRef:
+ name: bgp-peer-secret
+ key: password # MD5 TCP session auth — prevents session hijack
+ keepOriginalNextHop: false
+ maxRestartTime: 60s
nodeSelector: >-
- all()
+ kubernetes.io/hostname in {'worker-01', 'worker-02'} # Scope to rack nodes only
Verify after applying:
calicoctl node status
# Target: STATE = Established, INFO = Established X routes imported, Y exported
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate BGPPeer manifests with OPA/Conftest before apply
# policy/bgp_peer.rego
package calico.bgppeer
deny[msg] {
input.kind == "BGPPeer"
not input.spec.asNumber
msg := "BGPPeer must specify asNumber explicitly"
}
deny[msg] {
input.kind == "BGPPeer"
input.spec.nodeSelector == "all()"
msg := "BGPPeer nodeSelector 'all()' is too broad for production — scope to specific nodes"
}
conftest test bgppeer.yaml --policy policy/
2. Smoke-test BGP state in post-deploy pipeline
#!/bin/bash
# ci/verify-bgp.sh — run after Calico config changes
UNESTABLISHED=$(calicoctl node status 2>/dev/null | grep -v Established | grep -c 'start\|Active\|Connect')
if [ "$UNESTABLISHED" -gt 0 ]; then
echo "FATAL: $UNESTABLISHED BGP peer(s) not established after deploy"
exit 1
fi
echo "All BGP peers established."
3. Terraform — enforce security group BGP rule presence
# checkov ignore is NOT acceptable here — enforce with policy
# Add to your Checkov custom checks:
# CKV_CUSTOM_BGP_01: "Ensure node security group allows TCP/179 intra-group"
Run Checkov in CI:
checkov -d ./terraform --check CKV_CUSTOM_BGP_01 --compact
4. Alert on BGP session flaps via Prometheus
Calico exports felix_bgp_num_peers and felix_bgp_num_established_peers. Alert when they diverge:
# prometheus-rule.yaml
groups:
- name: calico-bgp
rules:
- alert: CalicoBGPPeerNotEstablished
expr: felix_bgp_num_peers - felix_bgp_num_established_peers > 0
for: 2m
labels:
severity: critical
annotations:
summary: "BGP peer down on {{ $labels.instance }} — pod routing degraded"