Fixing 'kubelet failed to initialize network' on Talos Linux with a Custom CNI Plugin
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: kubelet started before the custom CNI binary/config landed in
/etc/cni/net.d/or/opt/cni/bin/, causing a hard network init failure that blocks all pod scheduling. - How to fix it: Either switch to Talos's native
inlineCNI configuration (which stages the config before kubelet starts) or ensure your CNI DaemonSet uses an init container that writes binaries before kubelet's first reconcile loop. - Fast path: Use our Client-Side Sandbox below to auto-refactor your Talos
machine.yaml— it detects missing CNI stanzas and generates the corrected inline config locally without sending your config off-device.
The Incident (What does the error mean?)
Raw kubelet log output from talosctl logs kubelet:
E0612 03:14:22.918311 1423 kubelet.go:2419] "Failed to initialize network plugin"
err="network plugin is not ready: cni config uninitialized"
W0612 03:14:32.101847 1423 cni.go:239] Unable to update cni config: no networks found in /etc/cni/net.d/
E0612 03:14:32.102001 1423 kubelet.go:2419] "Failed to initialize network plugin"
err="network plugin is not ready: cni config uninitialized"
Immediate consequence: kubelet enters a retry loop. Every pod on the node stays in ContainerCreating. The node registers in the API server but reports NotReady. If this is a control plane node, etcd quorum may still hold, but kube-apiserver pods will not schedule, and a full cluster bootstrap stalls permanently.
On Talos specifically, /etc/cni/net.d/ and /opt/cni/bin/ are ephemeral and read-only by default. A CNI DaemonSet that tries to write binaries post-boot races against kubelet's startup sequence — and kubelet wins the race, initializes with no CNI, and does not re-probe cleanly.
The Attack Vector / Blast Radius
This is a bootstrap deadlock, not a transient error. The blast radius:
- New node joins cluster → kubelet starts → no CNI config found → kubelet marks network plugin unready.
- CNI DaemonSet pod (e.g., Cilium, Calico, Flannel) is scheduled to this node to install the CNI — but it cannot run because the node is
NotReadydue to missing CNI. Classic chicken-and-egg. - On a single-node or 3-node control plane, this means zero workloads schedule. If you're doing a rolling upgrade and hit this on two nodes simultaneously, you lose quorum.
- Talos-specific amplifier: Talos mounts the root filesystem read-only. Any CNI installer that tries to
cpbinaries to/opt/cni/bin/via a hostPath volume will silently fail or hit a permission error unless the path is explicitly listed in Talos'smachine.filesor the CNI uses the correct Talos extension mechanism.
The secondary risk: operators often "fix" this by disabling the Talos immutability model (allowSchedulingOnControlPlanes: true with no taints, manual chroot hacks) — introducing real security regressions to paper over a config problem.
How to Fix It
Basic Fix — Use Talos Inline CNI
Talos has a first-class cluster.network.cni stanza. When set to none with no inline config, kubelet gets nothing. The fix is to either use a supported CNI name or provide a full inline config that Talos stages before kubelet starts.
# machine.yaml (Talos MachineConfig)
cluster:
network:
- cni:
- name: none
+ cni:
+ name: custom
+ urls:
+ - https://raw.githubusercontent.com/your-org/cni-manifests/main/cilium-install.yaml
For air-gapped or fully controlled deployments, use the inline approach with explicit CNI config written to disk by Talos itself:
# machine.yaml — machine.files stanza
machine:
files:
+ - path: /etc/cni/net.d/10-cilium.conflist
+ permissions: 0644
+ op: create
+ content: |
+ {
+ "name": "cilium",
+ "cniVersion": "0.3.1",
+ "plugins": [
+ {"type": "cilium-cni", "enable-debug": false},
+ {"type": "portmap", "capabilities": {"portMappings": true}}
+ ]
+ }
This file is written during Talos's install phase, before kubelet's first start. Race condition eliminated.
Enterprise Best Practice — Talos System Extensions + Helm CNI Bootstrap
For production clusters, do not rely on DaemonSet-based CNI installers writing to the host filesystem. Use Talos System Extensions to bake CNI binaries into the node image at build time.
Step 1: Build a custom Talos image with CNI binaries included
# ImageFactory schematic.yaml
customization:
systemExtensions:
officialExtensions:
+ - siderolabs/cilium
- # No CNI extension — relying on DaemonSet hostPath writes
Generate via https://factory.talos.dev or talosctl image with your schematic. The extension places binaries in the correct immutable paths at image build time.
Step 2: Disable the CNI installer init containers in your Helm values (they're now redundant and will conflict)
# cilium/values.yaml
cni:
- install: true
+ install: false
binPath: /opt/cni/bin
confPath: /etc/cni/net.d
Step 3: Bootstrap order in Talos — apply machine config before talosctl bootstrap
# Correct order — config applied first, binaries present before kubelet
talosctl apply-config --insecure --nodes $NODE_IP --file controlplane.yaml
# Wait for staged: talosctl get machinestatus
talosctl bootstrap --nodes $NODE_IP
# If using Talos with Cilium kube-proxy replacement, ensure this is set:
cluster:
proxy:
- disabled: false
+ disabled: true # Required for Cilium kube-proxy replacement mode
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate Talos MachineConfig in Pipeline Before Apply
Use talosctl validate as a pre-flight gate:
# In your GitHub Actions / GitLab CI
talosctl validate --mode cloud --config controlplane.yaml
talosctl validate --mode cloud --config worker.yaml
This catches missing cni stanzas, malformed machine.files entries, and invalid API versions before they hit a real node.
2. OPA/Conftest Policy — Enforce CNI Config Presence
# policy/talos_cni.rego
package talos.cni
deny[msg] {
input.cluster.network.cni.name == "none"
count(input.machine.files) == 0
msg := "CNI set to 'none' but no machine.files CNI config provided. kubelet will fail to initialize network."
}
deny[msg] {
not input.cluster.network.cni
msg := "cluster.network.cni stanza is missing entirely. Talos will default to no CNI."
}
Run in CI:
conftest test controlplane.yaml --policy policy/
3. Talos Upgrade Pipeline — Verify Node Ready Before Proceeding
# After talosctl upgrade, gate on kubelet network ready — not just node registered
kubectl wait node/$NODE --for=condition=Ready --timeout=300s
# Also verify CNI pods are running before proceeding to next node
kubectl rollout status daemonset/cilium -n kube-system --timeout=120s
Never use a simple sleep 60 between node upgrades. The CNI DaemonSet rollout is your real readiness signal.