How to Fix Kubelet PLEG Not Healthy: Node NotReady Debugging Guide
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: Kubelet's PLEG loop stopped syncing container runtime state. The node flipped to
NotReady, evicting or orphaning all pods scheduled on it. - How to fix it: Identify whether the root cause is a container runtime deadlock (containerd/dockerd hung), inotify descriptor exhaustion, or disk/memory pressure — then restart the offending service or tune kernel parameters.
- Shortcut: Drop your
kubectl describe nodeandjournalctl -u kubeletoutput into the Client-Side Sandbox above to get an automated root-cause diagnosis and remediation script.
The Incident (What Does the Error Mean?)
Raw error — as it appears in journalctl -u kubelet -f:
E0610 03:42:17.123456 1234 pleg.go:418] PLEG is not healthy: pleg was last seen active 3m0.123456789s ago; threshold is 3m0s
W0610 03:42:17.123456 1234 node_lifecycle_controller.go:1501] Node <node-name> status is now: NodeNotReady
PLEG (Pod Lifecycle Event Generator) is the kubelet subsystem that polls the container runtime (via CRI) every plegRelistPeriod (default: 1s) and generates lifecycle events (started, died, etc.) for every container. It has a hard watchdog threshold of 3 minutes. If the relist loop hasn't completed within that window, kubelet marks itself unhealthy and the node controller marks the node NotReady.
Immediate consequence: All pods on this node lose their Ready condition. Deployments with PodDisruptionBudgets may block rollouts cluster-wide. Stateful workloads (databases, Kafka brokers) that were pinned to this node are now down.
The Attack Vector / Blast Radius
This is not a soft warning — this is a full node outage event.
Cascading failure chain:
- PLEG stalls → kubelet cannot reconcile container state → pods stuck in
Unknown - Node controller waits
pod-eviction-timeout(default: 5 min) → starts force-evicting pods - If the node has
NoExecutetaints added automatically, DaemonSets are also evicted - If this node hosts a control-plane component (etcd, kube-apiserver in self-managed clusters) → cluster quorum risk
- Persistent Volumes with
ReadWriteOnceremain attached to the dead node → new pods on replacement nodes cannot mount the PVC (the infamous "multi-attach" error follows)
Most common root causes ranked by frequency in production:
| Root Cause | Signal |
|---|---|
containerd or dockerd goroutine deadlock |
crictl ps hangs; high D-state processes |
| inotify watch limit exhausted | dmesg shows inotify_add_watch: No space left |
| Disk pressure (imagefs or nodefs full) | kubectl describe node shows DiskPressure=True |
| Zombie/unkillable containers | crictl ps shows containers stuck in UNKNOWN state |
| Kernel version bug with cgroups v2 | Specific kernel + containerd version matrix |
How to Fix It (The Solution)
Step 1: Triage in 60 Seconds
# Is the container runtime responding?
crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps
# Check for disk pressure
kubectl describe node <node-name> | grep -A5 Conditions
# Check inotify limits
cat /proc/sys/fs/inotify/max_user_watches
sysctl fs.inotify.max_user_instances
# Find zombie containers
crictl ps -a | grep -v Running
# Pull kubelet logs for the 10 minutes before the event
journalctl -u kubelet --since "10 minutes ago" | grep -E 'PLEG|relist|runtime'
Basic Fix — Restart the Container Runtime
If crictl ps hangs or returns errors, the runtime is deadlocked. This is the most common cause.
# For containerd
systemctl restart containerd
systemctl restart kubelet
# For docker (legacy)
systemctl restart docker
systemctl restart kubelet
# Verify node recovers
kubectl get node <node-name> --watch
⚠️ Restarting containerd will briefly interrupt running containers. On a node already in
NotReady, this is acceptable. Do NOT do this on a healthy node without draining first.
Fix — inotify Exhaustion (Common on High-Density Nodes)
# /etc/sysctl.d/99-kubelet-inotify.conf
- fs.inotify.max_user_watches = 8192
- fs.inotify.max_user_instances = 128
+ fs.inotify.max_user_watches = 1048576
+ fs.inotify.max_user_instances = 512
sysctl -p /etc/sysctl.d/99-kubelet-inotify.conf
Enterprise Best Practice — Tune PLEG Relist Period and Add Runtime Liveness Probes
The default plegRelistPeriod of 1s is aggressive on nodes running 100+ containers. Tuning it reduces false-positive PLEG stalls caused by slow CRI responses under load.
# /var/lib/kubelet/config.yaml (KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
- # No relist tuning — using defaults
+ runtimeRequestTimeout: 15m
+ # Increase image pull / CRI operation timeout for high-density nodes
evictionHard:
- nodefs.available: "100Mi"
- imagefs.available: "1Gi"
+ nodefs.available: "500Mi"
+ imagefs.available: "5Gi"
+ nodefs.inodesFree: "5%"
# Node-level containerd config: /etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri"]
- max_concurrent_downloads = 3
+ max_concurrent_downloads = 6
[plugins."io.containerd.grpc.v1.cri".containerd]
- # default snapshotter
+ snapshotter = "overlayfs"
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Node Problem Detector (NPD) — Deploy It If You Haven't
# Deploy node-problem-detector as a DaemonSet
# It surfaces PLEG and kernel issues as Node Conditions before kubelet gives up
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
namespace: kube-system
spec:
selector:
matchLabels:
app: node-problem-detector
template:
spec:
containers:
- name: node-problem-detector
image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
securityContext:
privileged: true
2. Alert Before the 3-Minute Threshold
# Prometheus alerting rule
- alert: KubeletPLEGDurationHigh
expr: kubelet_pleg_relist_duration_seconds_quantile{quantile="0.99"} > 10
for: 2m
labels:
severity: warning
annotations:
summary: "PLEG relist p99 > 10s on {{ $labels.node }} — PLEG stall imminent"
- alert: KubeletPLEGNotHealthy
expr: kube_node_status_condition{condition="Ready",status="false"} == 1
for: 1m
labels:
severity: critical
3. Enforce Node Capacity Limits via OPA/Gatekeeper
Prevent over-scheduling that causes container runtime saturation:
# OPA ConstraintTemplate to cap pods-per-node
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sMaxPodsPerNode
metadata:
name: max-pods-per-node
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
maxPods: 80 # Below kubelet default of 110 — leave headroom for system pods
4. Add to Your Node Provisioning Pipeline (Terraform/Ansible)
# In your node bootstrap userdata / cloud-init
+ cat > /etc/sysctl.d/99-k8s-node.conf <<EOF
+ fs.inotify.max_user_watches=1048576
+ fs.inotify.max_user_instances=512
+ kernel.pid_max=4194304
+ EOF
+ sysctl -p /etc/sysctl.d/99-k8s-node.conf
Bake this into your golden AMI / node image. A node that boots without these settings on a high-density cluster is a PLEG incident waiting to happen.