Initializing Enclave...

How to Fix Kubelet Garbage Collection Failures Caused by Device or Resource Busy Locks

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: Kubelet's garbage collection loop cannot remove stopped/dead containers because the overlay or devicemapper mount is still held open by a leaked process, a stale bind mount, or a zombie containerd/docker shim.
  • How to fix it: Identify the blocking PID with lsof or fuser, kill or unmount it, then force-trigger GC or restart the container runtime.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your kubelet config and generate the exact crictl/nsenter remediation commands for your specific runtime.

The Incident (What Does the Error Mean?)

Raw kubelet log output:

E0612 03:14:22.871042    1423 image_gc_manager.go:305] Failed to garbage collect required amount of images. Wanted to free 5368709120 bytes, but only found 0 bytes eligible to free.
E0612 03:14:22.871198    1423 container_gc.go:85]  Failed to garbage collect containers: device or resource busy
E0612 03:14:22.871301    1423 kubelet.go:1302] Image garbage collection failed multiple times in a row: failed to garbage collect required amount of images

Immediate consequence: Kubelet's GC loop is stuck. Dead containers accumulate. Overlay mounts pile up under /var/lib/containerd or /var/lib/docker/overlay2. The node's disk fills, inode table exhausts, and new pod scheduling fails with FailedMount or Evicted events. In severe cases the node goes NotReady.


The Attack Vector / Blast Radius

This is a cascading disk exhaustion failure, not a one-pod problem.

  1. Overlay FS leak: Each stopped container leaves an overlay mount. If a process (log forwarder, sidecar shim, nsenter session left open by an operator) holds a file descriptor into that overlay directory, the kernel returns EBUSY on umount2(). Kubelet cannot proceed.

  2. Blast radius escalates fast:

    • Disk fills → new image pulls fail → ImagePullBackOff cluster-wide on that node.
    • Inode exhaustion hits before raw disk in environments with many small files (log-heavy workloads). df -i will show 100% inode usage while df -h shows free space — a notoriously confusing failure mode.
    • Kubelet's eviction manager cannot evict pods fast enough if GC is blocked, triggering a hard node pressure condition.
    • If this node is a control plane node, etcd or API server pods may fail to restart after eviction.
  3. Common culprits:

    • Fluent Bit / Fluentd tailing logs inside a dead container's overlay directory.
    • A kubectl exec or kubectl debug session left open by an operator.
    • Stale containerd-shim or docker-containerd-shim processes not reaped after container exit.
    • auditd or a security agent (Falco, Sysdig) holding an inotify watch on the overlay mount.
    • NFS or CSI volume not unmounted cleanly before container termination.

How to Fix It

Step 1: Identify the blocking mount and PID

# Find all overlay mounts for dead containers
grep overlay /proc/mounts | awk '{print $2}'

# Find what is holding each busy mount
# Replace <mount_path> with the path from above
sudo lsof +D <mount_path> 2>/dev/null
sudo fuser -vm <mount_path> 2>/dev/null

# For containerd: list dead containers
sudo crictl ps -a --state Exited
sudo crictl ps -a --state Created

# For docker runtime:
sudo docker ps -a --filter status=exited --filter status=dead

Step 2: Kill the blocking process or force-unmount

# Kill the offending PID (replace <PID>)
sudo kill -9 <PID>

# If the process is gone but mount is still busy (kernel ref count):
sudo umount -l <mount_path>   # lazy unmount — detaches from namespace immediately

# Force remove dead containers via crictl
sudo crictl rm $(sudo crictl ps -a --state Exited -q)

Basic Fix — Restart the container runtime to reap all stale shims

# containerd
sudo systemctl restart containerd
sudo systemctl restart kubelet

# docker (if using dockershim — EOL, migrate off this)
sudo systemctl restart docker
sudo systemctl restart kubelet

⚠️ Restarting containerd briefly interrupts running pods on the node. Cordon first in production: kubectl cordon <node>

Enterprise Best Practice — Tune kubelet GC thresholds and add eviction safeguards

The default kubelet GC config is too permissive for high-churn workloads. Patch your kubelet config:

# /var/lib/kubelet/config.yaml
  imageGCHighThresholdPercent: 85
- imageGCLowThresholdPercent: 80
+ imageGCLowThresholdPercent: 70

- containerLogMaxSize: "10Mi"
+ containerLogMaxSize: "5Mi"
- containerLogMaxFiles: 5
+ containerLogMaxFiles: 3

+ evictionHard:
+   nodefs.available: "10%"
+   nodefs.inodesFree: "5%"
+   imagefs.available: "15%"

For containerd, ensure shim reaping is not disabled:

# /etc/containerd/config.toml
  [plugins."io.containerd.runtime.v1.linux"]
-   shim_debug = true
+   shim_debug = false
+   no_shim = false

For Fluent Bit (common culprit) — ensure it does not tail into overlay paths directly:

  [INPUT]
      Name              tail
-     Path              /var/lib/containerd/io.containerd.runtime.v2.task/*/log
+     Path              /var/log/containers/*.log
+     DB                /var/log/flb_kube.db
+     Rotate_Wait       5

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Node-level monitoring — alert before disk fills

# Prometheus alerting rule
- alert: NodeDiskPressureImminent
  expr: |
    (node_filesystem_avail_bytes{mountpoint="/var/lib/containerd"} /
     node_filesystem_size_bytes{mountpoint="/var/lib/containerd"}) < 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.instance }} containerd disk < 15%"

- alert: NodeInodeExhaustion
  expr: |
    node_filesystem_files_free{mountpoint="/"} /
    node_filesystem_files{mountpoint="/"} < 0.05
  for: 2m
  labels:
    severity: critical

2. OPA/Gatekeeper — enforce log size limits on all pods

package k8srequiredloglimits

violation[{"msg": msg}] {
  container := input.review.object.spec.containers[_]
  not container.resources.limits["ephemeral-storage"]
  msg := sprintf("Container '%v' must set ephemeral-storage limit", [container.name])
}

3. Checkov / kube-linter in CI pipeline

# Add to your GitHub Actions / GitLab CI
kube-linter lint ./k8s-manifests/ \
  --checks container-resources,no-read-only-root-fs

checkov -d ./k8s-manifests \
  --check CKV_K8S_11,CKV_K8S_13  # CPU/memory limits enforced

4. Periodic node hygiene CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: containerd-gc-assist
  namespace: kube-system
spec:
  schedule: "0 */4 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          hostPID: true
          tolerations:
          - operator: Exists
          containers:
          - name: gc
            image: alpine:3.19
            securityContext:
              privileged: true
            command:
            - /bin/sh
            - -c
            - |
              crictl rmp --force $(crictl pods --state NotReady -q) 2>/dev/null || true
              crictl rm $(crictl ps -a --state Exited -q) 2>/dev/null || true
          restartPolicy: OnFailure
          nodeSelector:
            kubernetes.io/os: linux

Long-term: If this node runs high-churn batch workloads, dedicate a separate imagefs disk mounted at /var/lib/containerd to isolate container storage from root filesystem pressure entirely.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →