Initializing Enclave...

Fixing 'kubelet failed to initialize network' on Talos Linux with a Custom CNI Plugin

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: kubelet started before the custom CNI binary/config landed in /etc/cni/net.d/ or /opt/cni/bin/, causing a hard network init failure that blocks all pod scheduling.
  • How to fix it: Either switch to Talos's native inline CNI configuration (which stages the config before kubelet starts) or ensure your CNI DaemonSet uses an init container that writes binaries before kubelet's first reconcile loop.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your Talos machine.yaml — it detects missing CNI stanzas and generates the corrected inline config locally without sending your config off-device.

The Incident (What does the error mean?)

Raw kubelet log output from talosctl logs kubelet:

E0612 03:14:22.918311    1423 kubelet.go:2419] "Failed to initialize network plugin"
err="network plugin is not ready: cni config uninitialized"
W0612 03:14:32.101847    1423 cni.go:239] Unable to update cni config: no networks found in /etc/cni/net.d/
E0612 03:14:32.102001    1423 kubelet.go:2419] "Failed to initialize network plugin"
err="network plugin is not ready: cni config uninitialized"

Immediate consequence: kubelet enters a retry loop. Every pod on the node stays in ContainerCreating. The node registers in the API server but reports NotReady. If this is a control plane node, etcd quorum may still hold, but kube-apiserver pods will not schedule, and a full cluster bootstrap stalls permanently.

On Talos specifically, /etc/cni/net.d/ and /opt/cni/bin/ are ephemeral and read-only by default. A CNI DaemonSet that tries to write binaries post-boot races against kubelet's startup sequence — and kubelet wins the race, initializes with no CNI, and does not re-probe cleanly.


The Attack Vector / Blast Radius

This is a bootstrap deadlock, not a transient error. The blast radius:

  1. New node joins cluster → kubelet starts → no CNI config found → kubelet marks network plugin unready.
  2. CNI DaemonSet pod (e.g., Cilium, Calico, Flannel) is scheduled to this node to install the CNI — but it cannot run because the node is NotReady due to missing CNI. Classic chicken-and-egg.
  3. On a single-node or 3-node control plane, this means zero workloads schedule. If you're doing a rolling upgrade and hit this on two nodes simultaneously, you lose quorum.
  4. Talos-specific amplifier: Talos mounts the root filesystem read-only. Any CNI installer that tries to cp binaries to /opt/cni/bin/ via a hostPath volume will silently fail or hit a permission error unless the path is explicitly listed in Talos's machine.files or the CNI uses the correct Talos extension mechanism.

The secondary risk: operators often "fix" this by disabling the Talos immutability model (allowSchedulingOnControlPlanes: true with no taints, manual chroot hacks) — introducing real security regressions to paper over a config problem.


How to Fix It

Basic Fix — Use Talos Inline CNI

Talos has a first-class cluster.network.cni stanza. When set to none with no inline config, kubelet gets nothing. The fix is to either use a supported CNI name or provide a full inline config that Talos stages before kubelet starts.

# machine.yaml (Talos MachineConfig)
 cluster:
   network:
-    cni:
-      name: none
+    cni:
+      name: custom
+      urls:
+        - https://raw.githubusercontent.com/your-org/cni-manifests/main/cilium-install.yaml

For air-gapped or fully controlled deployments, use the inline approach with explicit CNI config written to disk by Talos itself:

# machine.yaml — machine.files stanza
 machine:
   files:
+    - path: /etc/cni/net.d/10-cilium.conflist
+      permissions: 0644
+      op: create
+      content: |
+        {
+          "name": "cilium",
+          "cniVersion": "0.3.1",
+          "plugins": [
+            {"type": "cilium-cni", "enable-debug": false},
+            {"type": "portmap", "capabilities": {"portMappings": true}}
+          ]
+        }

This file is written during Talos's install phase, before kubelet's first start. Race condition eliminated.


Enterprise Best Practice — Talos System Extensions + Helm CNI Bootstrap

For production clusters, do not rely on DaemonSet-based CNI installers writing to the host filesystem. Use Talos System Extensions to bake CNI binaries into the node image at build time.

Step 1: Build a custom Talos image with CNI binaries included

# ImageFactory schematic.yaml
 customization:
   systemExtensions:
     officialExtensions:
+      - siderolabs/cilium
-      # No CNI extension — relying on DaemonSet hostPath writes

Generate via https://factory.talos.dev or talosctl image with your schematic. The extension places binaries in the correct immutable paths at image build time.

Step 2: Disable the CNI installer init containers in your Helm values (they're now redundant and will conflict)

# cilium/values.yaml
 cni:
-  install: true
+  install: false
   binPath: /opt/cni/bin
   confPath: /etc/cni/net.d

Step 3: Bootstrap order in Talos — apply machine config before talosctl bootstrap

# Correct order — config applied first, binaries present before kubelet
talosctl apply-config --insecure --nodes $NODE_IP --file controlplane.yaml
# Wait for staged: talosctl get machinestatus
talosctl bootstrap --nodes $NODE_IP
# If using Talos with Cilium kube-proxy replacement, ensure this is set:
 cluster:
   proxy:
-    disabled: false
+    disabled: true   # Required for Cilium kube-proxy replacement mode

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate Talos MachineConfig in Pipeline Before Apply

Use talosctl validate as a pre-flight gate:

# In your GitHub Actions / GitLab CI
talosctl validate --mode cloud --config controlplane.yaml
talosctl validate --mode cloud --config worker.yaml

This catches missing cni stanzas, malformed machine.files entries, and invalid API versions before they hit a real node.

2. OPA/Conftest Policy — Enforce CNI Config Presence

# policy/talos_cni.rego
package talos.cni

deny[msg] {
  input.cluster.network.cni.name == "none"
  count(input.machine.files) == 0
  msg := "CNI set to 'none' but no machine.files CNI config provided. kubelet will fail to initialize network."
}

deny[msg] {
  not input.cluster.network.cni
  msg := "cluster.network.cni stanza is missing entirely. Talos will default to no CNI."
}

Run in CI:

conftest test controlplane.yaml --policy policy/

3. Talos Upgrade Pipeline — Verify Node Ready Before Proceeding

# After talosctl upgrade, gate on kubelet network ready — not just node registered
kubectl wait node/$NODE --for=condition=Ready --timeout=300s
# Also verify CNI pods are running before proceeding to next node
kubectl rollout status daemonset/cilium -n kube-system --timeout=120s

Never use a simple sleep 60 between node upgrades. The CNI DaemonSet rollout is your real readiness signal.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →