Will restarting containerd kill my running pods?

Briefly, yes — containers will experience a short interruption as containerd restarts. However, since the node is already in NotReady state when PLEG has stalled, those pods are effectively dead to the scheduler anyway. Restarting containerd is the fastest path to recovery. Always drain the node first if it is still healthy and you are doing proactive maintenance.

Why does PLEG stall more often on nodes running 80+ pods?

PLEG's relist loop calls the CRI (container runtime interface) to list all containers on every cycle. On high-density nodes, this CRI ListContainers call can take several seconds if containerd is under I/O pressure or if overlay filesystem operations are slow. If a single relist cycle takes longer than the 3-minute cumulative threshold, kubelet trips the watchdog. Reducing max pods per node, using faster storage for the imageFs, and increasing runtimeRequestTimeout are the primary mitigations.

How do I tell if this is a containerd bug vs. a kernel/OS issue?

Run 'crictl ps' directly on the node. If it hangs indefinitely, the issue is in containerd or its shim layer — restart containerd. If crictl responds normally but kubelet still reports PLEG unhealthy, check 'dmesg | tail -50' for kernel OOM kills, inotify errors, or cgroup errors. Also check 'journalctl -k' for NFS/storage timeouts if your container images or volumes are network-backed, as slow storage I/O is a common hidden cause of PLEG stalls.

How to Fix Kubelet PLEG Not Healthy: Node NotReady Debugging Guide

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: Kubelet's PLEG loop stopped syncing container runtime state. The node flipped to NotReady, evicting or orphaning all pods scheduled on it.
How to fix it: Identify whether the root cause is a container runtime deadlock (containerd/dockerd hung), inotify descriptor exhaustion, or disk/memory pressure — then restart the offending service or tune kernel parameters.
Shortcut: Drop your kubectl describe node and journalctl -u kubelet output into the Client-Side Sandbox above to get an automated root-cause diagnosis and remediation script.

The Incident (What Does the Error Mean?)

Raw error — as it appears in journalctl -u kubelet -f:

E0610 03:42:17.123456    1234 pleg.go:418] PLEG is not healthy: pleg was last seen active 3m0.123456789s ago; threshold is 3m0s
W0610 03:42:17.123456    1234 node_lifecycle_controller.go:1501] Node <node-name> status is now: NodeNotReady

PLEG (Pod Lifecycle Event Generator) is the kubelet subsystem that polls the container runtime (via CRI) every plegRelistPeriod (default: 1s) and generates lifecycle events (started, died, etc.) for every container. It has a hard watchdog threshold of 3 minutes. If the relist loop hasn't completed within that window, kubelet marks itself unhealthy and the node controller marks the node NotReady.

Immediate consequence: All pods on this node lose their Ready condition. Deployments with PodDisruptionBudgets may block rollouts cluster-wide. Stateful workloads (databases, Kafka brokers) that were pinned to this node are now down.

The Attack Vector / Blast Radius

This is not a soft warning — this is a full node outage event.

Cascading failure chain:

PLEG stalls → kubelet cannot reconcile container state → pods stuck in Unknown
Node controller waits pod-eviction-timeout (default: 5 min) → starts force-evicting pods
If the node has NoExecute taints added automatically, DaemonSets are also evicted
If this node hosts a control-plane component (etcd, kube-apiserver in self-managed clusters) → cluster quorum risk
Persistent Volumes with ReadWriteOnce remain attached to the dead node → new pods on replacement nodes cannot mount the PVC (the infamous "multi-attach" error follows)

Most common root causes ranked by frequency in production:

Root Cause	Signal
`containerd` or `dockerd` goroutine deadlock	`crictl ps` hangs; high `D`-state processes
inotify watch limit exhausted	`dmesg` shows `inotify_add_watch: No space left`
Disk pressure (imagefs or nodefs full)	`kubectl describe node` shows `DiskPressure=True`
Zombie/unkillable containers	`crictl ps` shows containers stuck in `UNKNOWN` state
Kernel version bug with cgroups v2	Specific kernel + containerd version matrix

How to Fix It (The Solution)

Step 1: Triage in 60 Seconds

# Is the container runtime responding?
crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps

# Check for disk pressure
kubectl describe node <node-name> | grep -A5 Conditions

# Check inotify limits
cat /proc/sys/fs/inotify/max_user_watches
sysctl fs.inotify.max_user_instances

# Find zombie containers
crictl ps -a | grep -v Running

# Pull kubelet logs for the 10 minutes before the event
journalctl -u kubelet --since "10 minutes ago" | grep -E 'PLEG|relist|runtime'

Basic Fix — Restart the Container Runtime

If crictl ps hangs or returns errors, the runtime is deadlocked. This is the most common cause.

# For containerd
systemctl restart containerd
systemctl restart kubelet

# For docker (legacy)
systemctl restart docker
systemctl restart kubelet

# Verify node recovers
kubectl get node <node-name> --watch

⚠️ Restarting containerd will briefly interrupt running containers. On a node already in NotReady, this is acceptable. Do NOT do this on a healthy node without draining first.

Fix — inotify Exhaustion (Common on High-Density Nodes)

# /etc/sysctl.d/99-kubelet-inotify.conf

- fs.inotify.max_user_watches = 8192
- fs.inotify.max_user_instances = 128
+ fs.inotify.max_user_watches = 1048576
+ fs.inotify.max_user_instances = 512

sysctl -p /etc/sysctl.d/99-kubelet-inotify.conf

Enterprise Best Practice — Tune PLEG Relist Period and Add Runtime Liveness Probes

The default plegRelistPeriod of 1s is aggressive on nodes running 100+ containers. Tuning it reduces false-positive PLEG stalls caused by slow CRI responses under load.

# /var/lib/kubelet/config.yaml (KubeletConfiguration)

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
- # No relist tuning — using defaults
+ runtimeRequestTimeout: 15m
+ # Increase image pull / CRI operation timeout for high-density nodes
evictionHard:
-  nodefs.available: "100Mi"
-  imagefs.available: "1Gi"
+  nodefs.available: "500Mi"
+  imagefs.available: "5Gi"
+  nodefs.inodesFree: "5%"

# Node-level containerd config: /etc/containerd/config.toml

[plugins."io.containerd.grpc.v1.cri"]
-  max_concurrent_downloads = 3
+  max_concurrent_downloads = 6

[plugins."io.containerd.grpc.v1.cri".containerd]
-  # default snapshotter
+  snapshotter = "overlayfs"

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Node Problem Detector (NPD) — Deploy It If You Haven't

# Deploy node-problem-detector as a DaemonSet
# It surfaces PLEG and kernel issues as Node Conditions before kubelet gives up
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: node-problem-detector
  template:
    spec:
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.14
        securityContext:
          privileged: true

2. Alert Before the 3-Minute Threshold

# Prometheus alerting rule
- alert: KubeletPLEGDurationHigh
  expr: kubelet_pleg_relist_duration_seconds_quantile{quantile="0.99"} > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "PLEG relist p99 > 10s on {{ $labels.node }} — PLEG stall imminent"

- alert: KubeletPLEGNotHealthy
  expr: kube_node_status_condition{condition="Ready",status="false"} == 1
  for: 1m
  labels:
    severity: critical

3. Enforce Node Capacity Limits via OPA/Gatekeeper

Prevent over-scheduling that causes container runtime saturation:

# OPA ConstraintTemplate to cap pods-per-node
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sMaxPodsPerNode
metadata:
  name: max-pods-per-node
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Pod"]
  parameters:
    maxPods: 80  # Below kubelet default of 110 — leave headroom for system pods

4. Add to Your Node Provisioning Pipeline (Terraform/Ansible)

# In your node bootstrap userdata / cloud-init

+ cat > /etc/sysctl.d/99-k8s-node.conf <<EOF
+ fs.inotify.max_user_watches=1048576
+ fs.inotify.max_user_instances=512
+ kernel.pid_max=4194304
+ EOF
+ sysctl -p /etc/sysctl.d/99-k8s-node.conf

Bake this into your golden AMI / node image. A node that boots without these settings on a high-density cluster is a PLEG incident waiting to happen.