Fixing Canal CNI 'calico-node Not Ready': Bird BGP Peering Down Troubleshooting Guide
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke:
calico-nodepods areNotReadybecause the embeddedbird/bird6BGP daemon cannot establish peering sessions between nodes, causing cross-node pod traffic to drop silently. - How to fix it: Unblock TCP/179, correct IP autodetection, fix MTU, or repair the IPPool encapsulation mode — whichever
calicoctl node statusfingers as the fault. - Shortcut: Use our Client-Side Sandbox above — drop your
calicoctl node statusoutput and FelixConfiguration YAML to auto-generate the corrected config without leaking your cluster topology.
The Incident (What Does the Error Mean?)
Raw signals you'll see in production:
# kubectl get pods -n kube-system | grep calico-node
calico-node-4xk9p 0/1 Running 4 22m
calico-node-r7mnt 0/1 Running 2 22m
# kubectl logs -n kube-system calico-node-4xk9p -c calico-node
2024-01-15 03:42:17.831 [ERROR][8] felix/int_dataplane.go 1394: Failed to connect to BIRD
birdclient: failed to connect to bird socket: dial unix /var/run/calico/bird.ctl: connect: no such file or directory
# calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+----------+-------------+
| 10.0.1.45 | node-to-node mesh | start | 03:40:11 | Active |
| 10.0.1.67 | node-to-node mesh | start | 03:40:11 | Active |
+---------------+-------------------+-------+----------+-------------+
start / Active state in bird means the TCP session to port 179 never completed. The node is trying to initiate BGP but getting no response. Every pod scheduled on these nodes that needs to talk to a pod on a different node is silently black-holed. No ICMP unreachable. Traffic just disappears.
The Attack Vector / Blast Radius
This isn't a security exploit — it's a split-brain networking failure with a blast radius proportional to your cluster size.
Cascading failure chain:
birdBGP peering fails → no cross-node routes programmed in the kernel routing table.kube-proxyor eBPF dataplane still rewrites destination IPs via NAT, but the routed path doesn't exist.- Service traffic that lands on a backend pod on a remote node times out with no error surfaced to the application. Health checks pass locally, fail remotely.
- In clusters with
node-to-node mesh(default Canal config), one broken node poisons routes for all nodes it peers with — which is every other node. - Horizontal pod autoscaling kicks in because latency spikes → more pods scheduled → more nodes affected → full cluster meltdown.
Root causes in order of frequency:
| Cause | Signal |
|---|---|
| TCP/179 blocked by cloud security group / iptables | STATE: start forever in calicoctl node status |
Node IP autodetection picks wrong interface (e.g., picks docker0 or a VPN interface) |
Bird tries to peer on unreachable IP |
| MTU mismatch between IPPool and node NIC (IP-in-IP overhead not accounted for) | Peering establishes but routes flap; large packets drop |
| IP-in-IP or VXLAN encap disabled in IPPool but required by cloud | Routes installed but traffic never arrives |
| Duplicate AS numbers in custom BGP config | BGP OPEN rejected |
How to Fix It
Step 1: Confirm the actual fault
# On the affected node directly:
sudo calicoctl node status
# Check bird logs:
kubectl logs -n kube-system <calico-node-pod> -c calico-node | grep -i bird
# Verify port 179 reachability between nodes:
nc -zv <peer-node-ip> 179
# Check IP autodetection result:
kubectl get node <node-name> -o jsonpath='{.status.addresses}'
Fix A: TCP/179 Blocked (Most Common in AWS/GCP/Azure)
Basic Fix — Cloud Security Group: Add an inbound rule on your node security group allowing TCP/179 from the node security group itself (self-referencing rule).
# AWS Security Group rule (Terraform)
- # No rule for BGP
+ ingress {
+ from_port = 179
+ to_port = 179
+ protocol = "tcp"
+ self = true
+ description = "Canal BGP peering between nodes"
+ }
Also check iptables on the node itself:
iptables -L INPUT -n -v | grep 179
# If blocked, add:
iptables -I INPUT -p tcp --dport 179 -j ACCEPT
Fix B: Wrong Node IP Autodetection
# FelixConfiguration / calico-config ConfigMap in Canal DaemonSet
apiVersion: v1
kind: ConfigMap
metadata:
name: canal-config
namespace: kube-system
data:
- ip_autodetection_method: "first-found"
+ ip_autodetection_method: "interface=eth0"
+ # OR use CIDR-based detection if interface names vary:
+ # ip_autodetection_method: "cidr=10.0.0.0/8"
After patching the ConfigMap, restart the DaemonSet:
kubectl rollout restart daemonset/canal -n kube-system
Fix C: MTU Misconfiguration (IP-in-IP overhead)
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
- # mtuIfacePattern not set — Felix auto-detects, often wrong in cloud
+ mtuIfacePattern: "^eth0"
ipv6Support: false
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: default-ipv4-ippool
spec:
cidr: 192.168.0.0/16
- ipipMode: Never
+ ipipMode: Always # Required for most cloud providers without BGP support
- vxlanMode: Never
natOutgoing: true
+ # Set MTU 20 bytes less than node NIC MTU to account for IP-in-IP header
+ # Node NIC MTU 9001 (jumbo) → set pod MTU to 8981
+ # Node NIC MTU 1500 (standard) → set pod MTU to 1480
Enterprise Best Practice — enforce MTU via FelixConfiguration:
apiVersion: projectcalico.org/v3
kind: FelixConfiguration
metadata:
name: default
spec:
+ vxlanMTU: 1450
+ wireguardMTU: 1440
ipipEnabled: true
+ logSeverityScreen: Info
+ reportingInterval: 30s
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Fix D: Verify BGP Peering After Changes
# Should show 'Established' not 'start'/'Active'
calicoctl node status
# Verify routes are now programmed:
ip route show | grep bird
# Test cross-node pod connectivity:
kubectl run test-pod --image=busybox --rm -it -- ping <pod-ip-on-different-node>
Prevention in CI/CD
1. Pre-flight network validation with a DaemonSet job:
# Add to your cluster bootstrap pipeline
apiVersion: batch/v1
kind: Job
metadata:
name: bgp-preflight-check
spec:
template:
spec:
hostNetwork: true
containers:
- name: check
image: calico/node:v3.26.0
command: ["sh", "-c", "calicoctl node status && nc -zv $NODE_PEER_IP 179"]
2. OPA/Gatekeeper policy — block IPPool with ipipMode: Never on cloud providers:
package canal.bgp
deny[msg] {
input.kind == "IPPool"
input.spec.ipipMode == "Never"
input.spec.vxlanMode == "Never"
msg := "IPPool must have ipipMode or vxlanMode enabled for cloud deployments. BGP peering without encapsulation requires direct L2 adjacency."
}
3. Checkov / kube-linter in your Helm pipeline:
# Add to CI step before helm upgrade:
checkov -d ./canal-helm-values/ --check CKV_K8S_28
kube-linter lint canal-manifests/ --config .kube-linter.yaml
4. Alerting — don't wait for users to report it:
# Prometheus alert
- alert: CalicoNodeBGPPeeringDown
expr: felix_cluster_num_hosts_running - felix_cluster_num_hosts_with_workload > 0
for: 3m
labels:
severity: critical
annotations:
summary: "calico-node BGP peering degraded — cross-node routing may be broken"
runbook: "https://your-wiki/canal-bgp-runbook"
5. Lock your Canal/Calico version in CI — never use latest:
- image: calico/node:latest
+ image: calico/node:v3.26.4 # Pin. Test upgrades in staging. BGP config schema changes between minors.