Why does calico-node show 0/1 Ready even though the pod is Running?

The readiness probe on calico-node checks that Felix has successfully programmed the dataplane AND that bird BGP sessions are established. If bird cannot connect to any peer on TCP/179, Felix marks the node as not ready for routing, and the readiness probe fails. The container stays Running because the process is alive — it just can't do its job.

Can I run Canal CNI without BGP (bird) entirely?

Yes. Switch the IPPool to vxlanMode: Always and set ipipMode: Never. This makes Canal use VXLAN UDP encapsulation (port 4789) instead of BGP-distributed routes. Bird is still present in the container but becomes a no-op. This trades slightly higher CPU overhead for zero dependency on TCP/179 being open between nodes. Most managed Kubernetes (EKS, GKE, AKS) environments default to this.

How do I tell if the BGP peering issue is causing actual packet loss right now?

Run: `ip route show table all | grep -c bird` on the affected node. If the count is 0 or significantly lower than your number of other nodes, routes are missing. Cross-reference with `kubectl get pods -o wide --all-namespaces | grep ` to identify which pods are isolated. Then do a direct `ping` from a pod on the broken node to a pod IP on a healthy node. Packet loss confirms black-holing.

Fixing Canal CNI 'calico-node Not Ready': Bird BGP Peering Down Troubleshooting Guide

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: calico-node pods are NotReady because the embedded bird/bird6 BGP daemon cannot establish peering sessions between nodes, causing cross-node pod traffic to drop silently.
How to fix it: Unblock TCP/179, correct IP autodetection, fix MTU, or repair the IPPool encapsulation mode — whichever calicoctl node status fingers as the fault.
Shortcut: Use our Client-Side Sandbox above — drop your calicoctl node status output and FelixConfiguration YAML to auto-generate the corrected config without leaking your cluster topology.

The Incident (What Does the Error Mean?)

Raw signals you'll see in production:

# kubectl get pods -n kube-system | grep calico-node
calico-node-4xk9p   0/1   Running   4   22m
calico-node-r7mnt   0/1   Running   2   22m

# kubectl logs -n kube-system calico-node-4xk9p -c calico-node
2024-01-15 03:42:17.831 [ERROR][8] felix/int_dataplane.go 1394: Failed to connect to BIRD
birdclient: failed to connect to bird socket: dial unix /var/run/calico/bird.ctl: connect: no such file or directory

# calicoctl node status
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |   PEER TYPE       | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 10.0.1.45     | node-to-node mesh | start | 03:40:11 | Active      |
| 10.0.1.67     | node-to-node mesh | start | 03:40:11 | Active      |
+---------------+-------------------+-------+----------+-------------+

start / Active state in bird means the TCP session to port 179 never completed. The node is trying to initiate BGP but getting no response. Every pod scheduled on these nodes that needs to talk to a pod on a different node is silently black-holed. No ICMP unreachable. Traffic just disappears.

The Attack Vector / Blast Radius

This isn't a security exploit — it's a split-brain networking failure with a blast radius proportional to your cluster size.

Cascading failure chain:

bird BGP peering fails → no cross-node routes programmed in the kernel routing table.
kube-proxy or eBPF dataplane still rewrites destination IPs via NAT, but the routed path doesn't exist.
Service traffic that lands on a backend pod on a remote node times out with no error surfaced to the application. Health checks pass locally, fail remotely.
In clusters with node-to-node mesh (default Canal config), one broken node poisons routes for all nodes it peers with — which is every other node.
Horizontal pod autoscaling kicks in because latency spikes → more pods scheduled → more nodes affected → full cluster meltdown.

Root causes in order of frequency:

Cause	Signal
TCP/179 blocked by cloud security group / iptables	`STATE: start` forever in `calicoctl node status`
Node IP autodetection picks wrong interface (e.g., picks `docker0` or a VPN interface)	Bird tries to peer on unreachable IP
MTU mismatch between IPPool and node NIC (IP-in-IP overhead not accounted for)	Peering establishes but routes flap; large packets drop
IP-in-IP or VXLAN encap disabled in IPPool but required by cloud	Routes installed but traffic never arrives
Duplicate AS numbers in custom BGP config	BGP OPEN rejected

How to Fix It

Step 1: Confirm the actual fault

# On the affected node directly:
sudo calicoctl node status

# Check bird logs:
kubectl logs -n kube-system <calico-node-pod> -c calico-node | grep -i bird

# Verify port 179 reachability between nodes:
nc -zv <peer-node-ip> 179

# Check IP autodetection result:
kubectl get node <node-name> -o jsonpath='{.status.addresses}'

Fix A: TCP/179 Blocked (Most Common in AWS/GCP/Azure)

Basic Fix — Cloud Security Group: Add an inbound rule on your node security group allowing TCP/179 from the node security group itself (self-referencing rule).

# AWS Security Group rule (Terraform)
- # No rule for BGP
+ ingress {
+   from_port       = 179
+   to_port         = 179
+   protocol        = "tcp"
+   self            = true
+   description     = "Canal BGP peering between nodes"
+ }

Also check iptables on the node itself:

iptables -L INPUT -n -v | grep 179
# If blocked, add:
iptables -I INPUT -p tcp --dport 179 -j ACCEPT

Fix B: Wrong Node IP Autodetection

# FelixConfiguration / calico-config ConfigMap in Canal DaemonSet
 apiVersion: v1
 kind: ConfigMap
 metadata:
   name: canal-config
   namespace: kube-system
 data:
-  ip_autodetection_method: "first-found"
+  ip_autodetection_method: "interface=eth0"
+  # OR use CIDR-based detection if interface names vary:
+  # ip_autodetection_method: "cidr=10.0.0.0/8"

After patching the ConfigMap, restart the DaemonSet:

kubectl rollout restart daemonset/canal -n kube-system

Fix C: MTU Misconfiguration (IP-in-IP overhead)

 apiVersion: projectcalico.org/v3
 kind: FelixConfiguration
 metadata:
   name: default
 spec:
-  # mtuIfacePattern not set — Felix auto-detects, often wrong in cloud
+  mtuIfacePattern: "^eth0"
   ipv6Support: false

---
 apiVersion: projectcalico.org/v3
 kind: IPPool
 metadata:
   name: default-ipv4-ippool
 spec:
   cidr: 192.168.0.0/16
-  ipipMode: Never
+  ipipMode: Always        # Required for most cloud providers without BGP support
-  vxlanMode: Never
   natOutgoing: true
+  # Set MTU 20 bytes less than node NIC MTU to account for IP-in-IP header
+  # Node NIC MTU 9001 (jumbo) → set pod MTU to 8981
+  # Node NIC MTU 1500 (standard) → set pod MTU to 1480

Enterprise Best Practice — enforce MTU via FelixConfiguration:

 apiVersion: projectcalico.org/v3
 kind: FelixConfiguration
 metadata:
   name: default
 spec:
+  vxlanMTU: 1450
+  wireguardMTU: 1440
   ipipEnabled: true
+  logSeverityScreen: Info
+  reportingInterval: 30s

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Fix D: Verify BGP Peering After Changes

# Should show 'Established' not 'start'/'Active'
calicoctl node status

# Verify routes are now programmed:
ip route show | grep bird

# Test cross-node pod connectivity:
kubectl run test-pod --image=busybox --rm -it -- ping <pod-ip-on-different-node>

Prevention in CI/CD

1. Pre-flight network validation with a DaemonSet job:

# Add to your cluster bootstrap pipeline
apiVersion: batch/v1
kind: Job
metadata:
  name: bgp-preflight-check
spec:
  template:
    spec:
      hostNetwork: true
      containers:
      - name: check
        image: calico/node:v3.26.0
        command: ["sh", "-c", "calicoctl node status && nc -zv $NODE_PEER_IP 179"]

2. OPA/Gatekeeper policy — block IPPool with ipipMode: Never on cloud providers:

package canal.bgp

deny[msg] {
  input.kind == "IPPool"
  input.spec.ipipMode == "Never"
  input.spec.vxlanMode == "Never"
  msg := "IPPool must have ipipMode or vxlanMode enabled for cloud deployments. BGP peering without encapsulation requires direct L2 adjacency."
}

3. Checkov / kube-linter in your Helm pipeline:

# Add to CI step before helm upgrade:
checkov -d ./canal-helm-values/ --check CKV_K8S_28
kube-linter lint canal-manifests/ --config .kube-linter.yaml

4. Alerting — don't wait for users to report it:

# Prometheus alert
- alert: CalicoNodeBGPPeeringDown
  expr: felix_cluster_num_hosts_running - felix_cluster_num_hosts_with_workload > 0
  for: 3m
  labels:
    severity: critical
  annotations:
    summary: "calico-node BGP peering degraded — cross-node routing may be broken"
    runbook: "https://your-wiki/canal-bgp-runbook"

5. Lock your Canal/Calico version in CI — never use latest:

- image: calico/node:latest
+ image: calico/node:v3.26.4  # Pin. Test upgrades in staging. BGP config schema changes between minors.