Initializing Enclave...

Fixing Cilium 'BPF Map Creation Failed' Error: Kernel Version Compatibility Guide

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: Cilium's agent cannot create eBPF maps (LRU_HASH, SOCKMAP, or PERCPU_ARRAY) because the host kernel predates the syscall support for those map types — pod networking is completely down on affected nodes.
  • How to fix it: Upgrade the node kernel to ≥ 5.10 LTS (4.19 absolute minimum for basic Cilium, 5.10+ for full feature parity), or disable the specific Cilium features requiring newer map types via Helm values.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your Cilium Helm values.yaml or DaemonSet spec against your kernel version — no config leaves your browser.

The Incident (What does the error mean?)

Raw error from kubectl logs -n kube-system cilium-<pod> -c cilium-agent:

level=fatal msg="Failed to create BPF map" error="map create: operation not supported" mapName="cilium_lb4_services_v2" type=LRU_HASH
level=fatal msg="BPF map creation failed, cannot continue" subsys=maps

Or the kernel-level variant:

FATAL: kernel too old to run Cilium (4.14.0 < 4.19.57)

Immediate consequence: The cilium-agent DaemonSet pod on that node enters CrashLoopBackOff. The CNI binary is present but non-functional. Every pod scheduled to that node gets stuck in ContainerCreating with networkPlugin cni failed to set up pod network. The node is effectively a networking black hole.


The Attack Vector / Blast Radius

This is not a gradual degradation — it is a hard split-brain failure:

  1. Node-level CNI death: All pods on the affected node lose network setup. Existing pods with established network namespaces may retain connectivity temporarily, but any restart kills them permanently.
  2. Cluster autoscaler amplification: If your node pool is autoscaling and the base AMI/image has an old kernel, every new node the autoscaler provisions will immediately become a dead zone. You scale into the outage.
  3. Silent kube-proxy bypass failure: Cilium in kube-proxy replacement mode owns iptables bypass via eBPF. If cilium-agent never initializes, kube-proxy is also absent (it was disabled). Service VIP resolution fails completely — not just for new pods, but for the entire node's routing table.
  4. Hubble observability blackout: No agent means no flow logs, no policy audit trail. You are blind during the incident.
  5. Cascading pod eviction: The node condition NetworkUnavailable=True is set. The scheduler stops placing pods. If this hits multiple nodes simultaneously (e.g., a rolling node pool upgrade to a bad AMI), you lose quorum on stateful workloads.

The kernel version matrix that kills you:

Cilium Feature Minimum Kernel
Basic pod networking 4.19.57
LRU_HASH maps 4.10
SOCKMAP / socket-level LB 5.7
Host-reachable services 5.10
BPF NodePort (full) 5.10
Wireguard encryption 5.6
Bandwidth Manager (EDT) 5.1

How to Fix It

Step 0: Confirm the kernel version on the failing node

# Get kernel version per node
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

# Or exec into the cilium pod before it crashes
kubectl exec -n kube-system cilium-<pod> -- uname -r

Basic Fix — Disable Features Requiring Newer Kernels

If you cannot immediately upgrade the kernel, surgically disable the offending features in your Cilium Helm values:

# cilium/values.yaml

- kubeProxyReplacement: strict
+ kubeProxyReplacement: disabled

- hostServices:
-   enabled: true
+ hostServices:
+   enabled: false

- nodePort:
-   enabled: true
+ nodePort:
+   enabled: false

- bandwidthManager:
-   enabled: true
+ bandwidthManager:
+   enabled: false

- hubble:
-   enabled: true
+ hubble:
+   enabled: false

Then roll the DaemonSet:

helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  -f values-degraded.yaml

kubectl rollout restart daemonset/cilium -n kube-system

Enterprise Best Practice — Node Affinity + Kernel Enforcement

Do not let Cilium schedule on nodes that cannot support it. Enforce at the DaemonSet level and at the node pool level.

1. Label compliant nodes during bootstrap (cloud-init / userdata):

kubectl label node <node-name> cilium.io/kernel-compatible=true

2. Patch the Cilium DaemonSet with node affinity:

# cilium DaemonSet spec.template.spec

+ affinity:
+   nodeAffinity:
+     requiredDuringSchedulingIgnoredDuringExecution:
+       nodeSelectorTerms:
+       - matchExpressions:
+         - key: cilium.io/kernel-compatible
+           operator: In
+           values:
+           - "true"

3. Enforce minimum kernel in Helm values (Cilium 1.13+):

# values.yaml

+ k8s:
+   requireIPv4PodCIDR: true

+ affinity:
+   nodeAffinity:
+     requiredDuringSchedulingIgnoredDuringExecution:
+       nodeSelectorTerms:
+         - matchExpressions:
+           - key: kubernetes.io/os
+             operator: In
+             values:
+               - linux
+           - key: cilium.io/kernel-compatible
+             operator: In
+             values:
+               - "true"

4. For EKS/GKE/AKS — pin the node pool image to a kernel-guaranteed AMI:

# Terraform EKS node group example

 resource "aws_eks_node_group" "workers" {
   ...
-  ami_type = "AL2_x86_64"
+  ami_type = "AL2_x86_64"  # Ensure AMI >= al2-kernel-5.10
+  # Pin to a specific AMI ID validated at ≥5.10:
+  launch_template {
+    id      = aws_launch_template.cilium_compatible.id
+    version = "$Latest"
+  }
 }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Pre-flight kernel check in node bootstrap script

#!/bin/bash
# Add to EC2 userdata / cloud-init before kubelet starts
MIN_KERNEL="5.10"
CURRENT=$(uname -r | cut -d'-' -f1)

if ! printf '%s\n%s' "$MIN_KERNEL" "$CURRENT" | sort -V -C; then
  echo "FATAL: Kernel $CURRENT < required $MIN_KERNEL for Cilium. Halting."
  systemctl stop kubelet
  exit 1
fi

2. OPA/Gatekeeper policy — block Cilium DaemonSet on unlabeled nodes

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredNodeAffinity
metadata:
  name: cilium-kernel-affinity
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["DaemonSet"]
    namespaces: ["kube-system"]
  parameters:
    requiredAffinityKey: "cilium.io/kernel-compatible"
    requiredAffinityValue: "true"

3. Checkov / Trivy in your Helm pipeline

# .github/workflows/cilium-lint.yaml
- name: Render Cilium Helm chart
  run: helm template cilium cilium/cilium -f values.yaml > rendered.yaml

- name: Run Checkov on rendered manifest
  uses: bridgecrewio/checkov-action@master
  with:
    file: rendered.yaml
    framework: kubernetes

- name: Validate kernel label affinity exists
  run: |
    grep -q 'cilium.io/kernel-compatible' rendered.yaml || \
      (echo 'Missing kernel affinity in Cilium DaemonSet' && exit 1)

4. Cilium CLI pre-flight (run before every cluster upgrade)

cilium install --dry-run 2>&1 | grep -i "kernel"
cilium preflight install
cilium status --wait

The cilium preflight command explicitly validates kernel version compatibility before committing any changes. Run it in your upgrade pipeline as a blocking gate.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →