Initializing Enclave...

How to Fix Kubelet PLEG Not Healthy: Node NotReady Debugging Guide

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

  • What broke: Kubelet's PLEG loop stopped syncing container runtime state. The node flipped to NotReady, evicting or orphaning all pods scheduled on it.
  • How to fix it: Identify whether the root cause is a container runtime deadlock (containerd/dockerd hung), inotify descriptor exhaustion, or disk/memory pressure — then restart the offending service or tune kernel parameters.
  • Shortcut: Drop your kubectl describe node and journalctl -u kubelet output into the Client-Side Sandbox above to get an automated root-cause diagnosis and remediation script.

The Incident (What Does the Error Mean?)

Raw error — as it appears in journalctl -u kubelet -f:

E0610 03:42:17.123456    1234 pleg.go:418] PLEG is not healthy: pleg was last seen active 3m0.123456789s ago; threshold is 3m0s
W0610 03:42:17.123456    1234 node_lifecycle_controller.go:1501] Node <node-name> status is now: NodeNotReady

PLEG (Pod Lifecycle Event Generator) is the kubelet subsystem that polls the container runtime (via CRI) every plegRelistPeriod (default: 1s) and generates lifecycle events (started, died, etc.) for every container. It has a hard watchdog threshold of 3 minutes. If the relist loop hasn't completed within that window, kubelet marks itself unhealthy and the node controller marks the node NotReady.

Immediate consequence: All pods on this node lose their Ready condition. Deployments with PodDisruptionBudgets may block rollouts cluster-wide. Stateful workloads (databases, Kafka brokers) that were pinned to this node are now down.


The Attack Vector / Blast Radius

This is not a soft warning — this is a full node outage event.

Cascading failure chain:

  1. PLEG stalls → kubelet cannot reconcile container state → pods stuck in Unknown
  2. Node controller waits pod-eviction-timeout (default: 5 min) → starts force-evicting pods
  3. If the node has NoExecute taints added automatically, DaemonSets are also evicted
  4. If this node hosts a control-plane component (etcd, kube-apiserver in self-managed clusters) → cluster quorum risk
  5. Persistent Volumes with ReadWriteOnce remain attached to the dead node → new pods on replacement nodes cannot mount the PVC (the infamous "multi-attach" error follows)

Most common root causes ranked by frequency in production:

Root Cause Signal
containerd or dockerd goroutine deadlock crictl ps hangs; high D-state processes
inotify watch limit exhausted dmesg shows inotify_add_watch: No space left
Disk pressure (imagefs or nodefs full) kubectl describe node shows DiskPressure=True
Zombie/unkillable containers crictl ps shows containers stuck in UNKNOWN state
Kernel version bug with cgroups v2 Specific kernel + containerd version matrix

How to Fix It (The Solution)

Step 1: Triage in 60 Seconds

# Is the container runtime responding?
crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps

# Check for disk pressure
kubectl describe node <node-name> | grep -A5 Conditions

# Check inotify limits
cat /proc/sys/fs/inotify/max_user_watches
sysctl fs.inotify.max_user_instances

# Find zombie containers
crictl ps -a | grep -v Running

# Pull kubelet logs for the 10 minutes before the event
journalctl -u kubelet --since "10 minutes ago" | grep -E 'PLEG|relist|runtime'

Basic Fix — Restart the Container Runtime

If crictl ps hangs or returns errors, the runtime is deadlocked. This is the most common cause.

# For containerd
systemctl restart containerd
systemctl restart kubelet

# For docker (legacy)
systemctl restart docker
systemctl restart kubelet

# Verify node recovers
kubectl get node <node-name> --watch

⚠️ Restarting containerd will briefly interrupt running containers. On a node already in NotReady, this is acceptable. Do NOT do this on a healthy node without draining first.


Fix — inotify Exhaustion (Common on High-Density Nodes)

# /etc/sysctl.d/99-kubelet-inotify.conf

- fs.inotify.max_user_watches = 8192
- fs.inotify.max_user_instances = 128
+ fs.inotify.max_user_watches = 1048576
+ fs.inotify.max_user_instances = 512
sysctl -p /etc/sysctl.d/99-kubelet-inotify.conf

Enterprise Best Practice — Tune PLEG Relist Period and Add Runtime Liveness Probes

The default plegRelistPeriod of 1s is aggressive on nodes running 100+ containers. Tuning it reduces false-positive PLEG stalls caused by slow CRI responses under load.

# /var/lib/kubelet/config.yaml (KubeletConfiguration)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
- # No relist tuning — using defaults
+ runtimeRequestTimeout: 15m
+ # Increase image pull / CRI operation timeout for high-density nodes
evictionHard:
-  nodefs.available: "100Mi"
-  imagefs.available: "1Gi"
+  nodefs.available: "500Mi"
+  imagefs.available: "5Gi"
+  nodefs.inodesFree: "5%"
# Node-level containerd config: /etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri"]
-  max_concurrent_downloads = 3
+  max_concurrent_downloads = 6

[plugins."io.containerd.grpc.v1.cri".containerd]
-  # default snapshotter
+  snapshotter = "overlayfs"

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Node Problem Detector (NPD) — Deploy It If You Haven't

# Deploy node-problem-detector as a DaemonSet
# It surfaces PLEG and kernel issues as Node Conditions before kubelet gives up
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    spec:
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
        securityContext:
          privileged: true

2. Alert Before the 3-Minute Threshold

# Prometheus alerting rule
- alert: KubeletPLEGDurationHigh
  expr: kubelet_pleg_relist_duration_seconds_quantile{quantile="0.99"} > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "PLEG relist p99 > 10s on {{ $labels.node }} — PLEG stall imminent"

- alert: KubeletPLEGNotHealthy
  expr: kube_node_status_condition{condition="Ready",status="false"} == 1
  for: 1m
  labels:
    severity: critical

3. Enforce Node Capacity Limits via OPA/Gatekeeper

Prevent over-scheduling that causes container runtime saturation:

# OPA ConstraintTemplate to cap pods-per-node
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sMaxPodsPerNode
metadata:
  name: max-pods-per-node
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    maxPods: 80  # Below kubelet default of 110 — leave headroom for system pods

4. Add to Your Node Provisioning Pipeline (Terraform/Ansible)

# In your node bootstrap userdata / cloud-init

+ cat > /etc/sysctl.d/99-k8s-node.conf <<EOF
+ fs.inotify.max_user_watches=1048576
+ fs.inotify.max_user_instances=512
+ kernel.pid_max=4194304
+ EOF
+ sysctl -p /etc/sysctl.d/99-k8s-node.conf

Bake this into your golden AMI / node image. A node that boots without these settings on a high-density cluster is a PLEG incident waiting to happen.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →