How to Fix Kubelet 'Failed to Run: dial tcp 127.0.0.1:10250 Connection Refused' Error
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10–30 mins
TL;DR
- What broke: Kubelet cannot bind or connect to its own HTTPS serving port
10250, so the node never reachesReadystate and the API server cannot scrape metrics or exec into pods. - How to fix it: Identify whether another process owns port
10250, whether the--port/--healthz-portflags are misconfigured, or whether the container runtime socket path is wrong — then correct the kubelet config or systemd unit and restart. - Shortcut: Use our Client-Side Sandbox above to paste your kubelet flags or
KubeletConfigurationmanifest and auto-generate the corrected config.
The Incident (What Does the Error Mean?)
Failed to run kubelet: validate service connection:
dial tcp 127.0.0.1:10250: connect: connection refused
Kubelet starts, attempts to validate its own internal service connection on 127.0.0.1:10250 (the read-only HTTPS metrics/exec port), and gets an immediate connection refused. This means the Kubelet HTTP server never came up. The node registers as NotReady. Every pod scheduled to this node stays in Pending or Unknown. kubectl exec, kubectl logs, and Prometheus node scraping all fail simultaneously.
Immediate blast: Zero workloads run on this node. If this is a control-plane node, kube-apiserver may lose quorum depending on your HA topology.
The Attack Vector / Blast Radius
Port 10250 is the Kubelet API — it serves:
POST /exec(kubectl exec)POST /run(direct command execution)GET /metricsGET /pods
When Kubelet fails to bind 10250, the node is dead. But the secondary risk is misconfiguration that causes Kubelet to bind 10250 on 0.0.0.0 with anonymous auth enabled — that is a critical RCE vector (CVE-2018-1002105 class). Fixing this error is the moment to also audit the auth config.
Cascading failure chain:
- Node stays
NotReady→ DaemonSets fail → CNI pods evicted → network plane degrades. - In autoscaling clusters, the broken node triggers scale-out, spawning more broken nodes if the AMI/image carries the same misconfiguration.
kube-controller-managernode lifecycle controller starts evicting pods afternode-monitor-grace-period(default 40s) — healthy pods on other nodes get rescheduled, spiking load.
How to Fix It
Step 1 — Confirm the port is not already bound
# On the affected node:
sudo ss -tlnp | grep 10250
sudo lsof -i :10250
If another process owns 10250, kill it or reconfigure it. A stale kubelet process from a failed restart is the #1 cause.
sudo systemctl stop kubelet
sudo pkill -f kubelet
sudo ss -tlnp | grep 10250 # must be empty now
sudo systemctl start kubelet
Step 2 — Validate the KubeletConfiguration port binding
Bad config (anonymous auth + wrong port flag):
- apiVersion: kubelet.config.k8s.io/v1beta1
- kind: KubeletConfiguration
- port: 0
- readOnlyPort: 10255
- authentication:
- anonymous:
- enabled: true
- webhook:
- enabled: false
- authorization:
- mode: AlwaysAllow
Good config (authenticated, correct port, webhook authz):
+ apiVersion: kubelet.config.k8s.io/v1beta1
+ kind: KubeletConfiguration
+ port: 10250
+ readOnlyPort: 0
+ authentication:
+ anonymous:
+ enabled: false
+ webhook:
+ enabled: true
+ x509:
+ clientCAFile: /etc/kubernetes/pki/ca.crt
+ authorization:
+ mode: Webhook
+ tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
+ tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
port: 0disables the serving port entirely — Kubelet starts, tries to validate the connection, and immediately fails. This is the silent killer.
Step 3 — Check the container runtime socket
A missing or wrong CRI socket causes Kubelet to abort before the HTTP server initializes:
- --container-runtime-endpoint=unix:///var/run/dockershim.sock
+ --container-runtime-endpoint=unix:///run/containerd/containerd.sock
Verify the socket exists:
ls -la /run/containerd/containerd.sock
sudo systemctl status containerd
Step 4 — Firewall / iptables check
sudo iptables -L INPUT -n -v | grep 10250
# If a DROP rule exists for 10250 on loopback:
sudo iptables -D INPUT -p tcp --dport 10250 -j DROP
On cloud nodes (AWS/GCP/Azure), also verify the security group / firewall rule allows inbound 10250 from the control plane CIDR.
Enterprise Best Practice — Systemd Drop-in Hardening
# /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
[Service]
- ExecStart=/usr/bin/kubelet
+ ExecStart=/usr/bin/kubelet \
+ --config=/etc/kubernetes/kubelet-config.yaml \
+ --container-runtime-endpoint=unix:///run/containerd/containerd.sock \
+ --node-ip=$(curl -sf http://169.254.169.254/latest/meta-data/local-ipv4) \
+ --rotate-certificates=true \
+ --rotate-server-certificates=true
Always pass --config pointing to a versioned KubeletConfiguration file tracked in Git. Never rely on bare CLI flags in production — they are invisible to policy engines.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate KubeletConfiguration in your node bootstrap pipeline
# Using kubelet's own dry-run flag (v1.26+)
kubelet --config=/etc/kubernetes/kubelet-config.yaml --dry-run
2. Checkov policy for KubeletConfiguration
# checkov custom policy — block port:0 and anonymous auth
metadata:
name: CKV_KUBELET_PORT_ENABLED
check:
resource_type: KubeletConfiguration
attribute: port
operator: not_equals
value: 0
3. OPA/Gatekeeper ConstraintTemplate — enforce webhook authz
package kubelet.auth
violation[{"msg": msg}] {
input.authorization.mode != "Webhook"
msg := "Kubelet authorization mode must be Webhook, not AlwaysAllow"
}
violation[{"msg": msg}] {
input.authentication.anonymous.enabled == true
msg := "Kubelet anonymous authentication must be disabled"
}
4. Node readiness smoke test in your AMI bake pipeline
#!/bin/bash
# Run post-boot in Packer or EC2 user-data test phase
sleep 15
curl -sk https://127.0.0.1:10250/healthz | grep -q ok || { echo "KUBELET_HEALTHZ_FAIL"; exit 1; }
Fail the AMI build if 10250/healthz doesn't return ok within 30 seconds. This catches port misconfiguration before the image ever reaches production.