Initializing Enclave...

Fixing Calico BGP Peer 'Connection Refused': Why Your BGP Session Won't Establish and How to Resolve It

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: Calico's BIRD BGP daemon cannot reach the peer IP on TCP/179 — the remote end is actively refusing the connection, meaning BGP sessions are down and inter-node pod routing is black-holing traffic.
  • How to fix it: Verify the peer IP, AS numbers, node-to-node firewall rules on port 179, and whether bgp is enabled on the target node's Calico config. Confirm BIRD is running on the peer.
  • Use our Client-Side Sandbox below to auto-refactor this — paste your BGPPeer YAML or calicoctl node status output and get a corrected config without sending your topology to a third-party AI.

The Incident (What does the error mean?)

Raw output from calicoctl node status:

Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 10.0.1.45     | node-to-node mesh | start | 04:12:33 | Connect     |
| 192.168.10.1  | global            | start | 04:12:33 | Active      |
+---------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

Or from BIRD logs (kubectl logs -n kube-system <calico-node-pod> -c calico-node):

2024-01-15 04:13:01.456 [ERROR] Connection to BGP neighbor 10.0.1.45 failed: Connection refused
2024-01-15 04:13:01.457 [INFO]  BGP session with 10.0.1.45 moved to ACTIVE state

Immediate consequence: BGP is in Connect or Active state — never Established. Calico is not exchanging routes. Every pod on the affected node cannot reach pods on the peer node. kubectl exec cross-node calls hang. Services backed by pods on the unreachable node return connection timeouts. This is a full network partition for affected node pairs.


The Attack Vector / Blast Radius

This is not a flap — Connection refused means TCP/179 is being actively rejected by the remote host's kernel or firewall. This is distinct from a timeout (firewall drop) or a BGP NOTIFICATION (protocol-level rejection).

Cascading failure chain:

  1. BIRD on the local node attempts TCP handshake to peer:179.
  2. Remote kernel sends RST — port closed or firewall REJECT rule hit.
  3. BIRD backs off, retries on exponential timer (up to 120s intervals).
  4. No BGP routes exchanged → Calico Felix has no remote workload routes → programs no routes into the kernel routing table.
  5. All cross-node pod traffic is unroutable. In vxlan mode this is masked slightly longer; in native BGP mode it's instant.
  6. If this is a ToR/upstream router peer (global BGP peer for on-prem), your entire cluster loses external reachability — not just cross-node pod traffic.

In multi-tenant clusters, this silently breaks network policies that depend on cross-node enforcement, creating a false sense of isolation while traffic is simply dropped rather than policy-denied.


How to Fix It

Step 0: Confirm the exact failure mode

# From the affected calico-node pod
kubectl exec -n kube-system <calico-node-pod> -- nc -zv <PEER_IP> 179
# "Connection refused" = port closed or REJECT rule
# Timeout = DROP rule or wrong IP
# Success = BIRD config issue, not network

# Check BIRD daemon is alive on the PEER node
kubectl exec -n kube-system <calico-node-pod-on-peer> -- pgrep -a bird

Basic Fix — Firewall / Security Group on TCP/179

The most common cause in cloud environments (AWS, GCP, Azure) is missing inbound rules for BGP.

# AWS Security Group (Terraform)
 resource "aws_security_group_rule" "node_bgp" {
   type              = "ingress"
-  # Missing BGP rule — nodes can't peer
+  from_port         = 179
+  to_port           = 179
+  protocol          = "tcp"
+  self              = true
   security_group_id = aws_security_group.nodes.id
 }

For iptables-based hosts:

- # No rule permitting TCP/179
+ iptables -I INPUT -p tcp --dport 179 -s <PEER_CIDR> -j ACCEPT
+ iptables -I OUTPUT -p tcp --sport 179 -d <PEER_CIDR> -j ACCEPT

Basic Fix — Mismatched AS Numbers

# BGPPeer YAML
 apiVersion: projectcalico.org/v3
 kind: BGPPeer
 metadata:
   name: peer-to-tor-switch
 spec:
   peerIP: 192.168.10.1
-  asNumber: 64512
+  asNumber: 65001  # Must match the AS configured on the ToR switch
   node: worker-node-01

Basic Fix — BGP disabled on node or wrong node selector

# BGPConfiguration
 apiVersion: projectcalico.org/v3
 kind: BGPConfiguration
 metadata:
   name: default
 spec:
-  nodeToNodeMeshEnabled: false
+  nodeToNodeMeshEnabled: true
   asNumber: 64512

Or if using node-specific peer with wrong selector:

 apiVersion: projectcalico.org/v3
 kind: BGPPeer
 metadata:
   name: rack1-peer
 spec:
   peerIP: 10.0.1.45
   asNumber: 65000
-  nodeSelector: "rack == 'rack2'"  # Wrong rack label, peer node excluded
+  nodeSelector: "rack == 'rack1'"

Enterprise Best Practice — Structured BGP peer config with password auth and explicit selectors

 apiVersion: projectcalico.org/v3
 kind: BGPPeer
 metadata:
   name: upstream-tor-rack1
 spec:
   peerIP: 10.100.0.1
   asNumber: 65000
+  password:
+    secretKeyRef:
+      name: bgp-peer-secret
+      key: password          # MD5 TCP session auth — prevents session hijack
+  keepOriginalNextHop: false
+  maxRestartTime: 60s
   nodeSelector: >-
-    all()
+    kubernetes.io/hostname in {'worker-01', 'worker-02'}  # Scope to rack nodes only

Verify after applying:

calicoctl node status
# Target: STATE = Established, INFO = Established  X routes imported, Y exported

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate BGPPeer manifests with OPA/Conftest before apply

# policy/bgp_peer.rego
package calico.bgppeer

deny[msg] {
  input.kind == "BGPPeer"
  not input.spec.asNumber
  msg := "BGPPeer must specify asNumber explicitly"
}

deny[msg] {
  input.kind == "BGPPeer"
  input.spec.nodeSelector == "all()"
  msg := "BGPPeer nodeSelector 'all()' is too broad for production — scope to specific nodes"
}
conftest test bgppeer.yaml --policy policy/

2. Smoke-test BGP state in post-deploy pipeline

#!/bin/bash
# ci/verify-bgp.sh — run after Calico config changes
UNESTABLISHED=$(calicoctl node status 2>/dev/null | grep -v Established | grep -c 'start\|Active\|Connect')
if [ "$UNESTABLISHED" -gt 0 ]; then
  echo "FATAL: $UNESTABLISHED BGP peer(s) not established after deploy"
  exit 1
fi
echo "All BGP peers established."

3. Terraform — enforce security group BGP rule presence

# checkov ignore is NOT acceptable here — enforce with policy
# Add to your Checkov custom checks:
# CKV_CUSTOM_BGP_01: "Ensure node security group allows TCP/179 intra-group"

Run Checkov in CI:

checkov -d ./terraform --check CKV_CUSTOM_BGP_01 --compact

4. Alert on BGP session flaps via Prometheus

Calico exports felix_bgp_num_peers and felix_bgp_num_established_peers. Alert when they diverge:

# prometheus-rule.yaml
groups:
- name: calico-bgp
  rules:
  - alert: CalicoBGPPeerNotEstablished
    expr: felix_bgp_num_peers - felix_bgp_num_established_peers > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "BGP peer down on {{ $labels.instance }} — pod routing degraded"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →