Initializing Enclave...

How to Fix 'No Space Left on Device' During Docker Build on overlay2 (90% Disk Usage)

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–30 mins


TL;DR

  • What broke: docker build --no-cache exhausted the host filesystem. overlay2 accumulates unreferenced layer tarballs, dangling images, and BuildKit's content-addressable cache in /var/lib/docker--no-cache makes it worse by bypassing layer reuse and writing fresh layers every run.
  • How to fix it: Run docker system prune -af --volumes + purge BuildKit cache with docker builder prune -af, then reclaim /var/lib/docker/overlay2 orphans manually if needed.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your Dockerfile and CI pipeline config to prevent this from recurring.

The Incident (What Does the Error Mean?)

Raw error during docker build --no-cache -t myapp:latest .:

Step 14/22 : RUN npm ci
no space left on device
error building image: error building stage: failed to solve with frontend dockerfile.v0:
failed to read dockerfile: failed to create temp dir: no space left on device

or from BuildKit:

 => ERROR [build 7/9] RUN apt-get update && apt-get install -y build-essential
------
 > [build 7/9] RUN apt-get update && apt-get install -y build-essential:
------
executor failed running [/bin/sh -c apt-get update ...]: exit code: 1
Error: error from sender: open /proc/self/fd/...: no space left on device

Immediate consequence: The build daemon cannot allocate new overlay2 upper-dir layers. The host's root or /var/lib/docker mountpoint is at 100% capacity. Every subsequent docker command that writes to disk — including docker pull — will fail. If this is a CI runner, the agent itself may crash.


The Attack Vector / Blast Radius

This is not a one-time spike. It is a compounding failure mode:

  1. --no-cache forces full layer rewrite on every build. Each run writes a complete new set of overlay2 layers. Without cache reuse, old intermediate layers are not referenced but are not immediately GC'd.
  2. BuildKit's content-addressable store (/var/lib/docker/buildkit/) is never automatically purged. It accumulates blobs, snapshots, and metadata across every build invocation.
  3. Dangling images (<none>:<none>) from prior failed or replaced builds consume gigabytes silently. docker images -f dangling=true routinely shows 10–40 GB on active CI hosts.
  4. Container log files under /var/lib/docker/containers/<id>/ are unbounded by default. A single chatty container can consume the remaining free space.
  5. Cascading failure: Once the disk hits 100%, containerd and dockerd themselves can crash, taking down all running containers on the host — including unrelated production workloads if this is a shared node.

On Kubernetes nodes using overlay2, this triggers DiskPressure taint, evicting pods cluster-wide.


How to Fix It

Step 1 — Diagnose First (30 seconds)

# Where is space going?
df -h /var/lib/docker
docker system df -v

# Find the biggest offenders
du -sh /var/lib/docker/overlay2
du -sh /var/lib/docker/buildkit
du -sh /var/lib/docker/containers

# Dangling images
docker images -f dangling=true --format '{{.Size}}' | sort -rh | head

Basic Fix — Emergency Reclaim

# Nuclear option — removes ALL unused images, containers, networks, volumes
docker system prune -af --volumes

# Purge BuildKit cache specifically (often 5–20 GB)
docker builder prune -af

# Truncate container logs (replace <container_id> or loop all)
truncate -s 0 /var/lib/docker/containers/*/*-json.log

# Verify
df -h /var/lib/docker

Enterprise Best Practice — Dockerfile Refactor + Daemon Config

Problem: Fat single-stage Dockerfile bloating layer count

- FROM node:20
- COPY . .
- RUN npm install
- RUN npm run build
- RUN apt-get update && apt-get install -y curl git vim

+ FROM node:20-alpine AS builder
+ WORKDIR /app
+ COPY package*.json ./
+ RUN npm ci --omit=dev
+ COPY . .
+ RUN npm run build
+
+ FROM node:20-alpine AS runtime
+ WORKDIR /app
+ COPY --from=builder /app/dist ./dist
+ COPY --from=builder /app/node_modules ./node_modules
+ # No dev tools, no source, no cache debris in final image

Problem: No log rotation or BuildKit GC limits configured in daemon.json

# /etc/docker/daemon.json
 {
-  "storage-driver": "overlay2"
+  "storage-driver": "overlay2",
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "10m",
+    "max-file": "3"
+  },
+  "builder": {
+    "gc": {
+      "enabled": true,
+      "defaultKeepStorage": "10GB",
+      "policy": [
+        { "keepStorage": "10GB", "filter": ["unused-for=48h"] },
+        { "keepStorage": "5GB",  "all": true }
+      ]
+    }
+  }
 }

Apply with: systemctl restart docker

Problem: CI pipeline runs --no-cache unconditionally

- docker build --no-cache -t myapp:$CI_COMMIT_SHA .

+ docker build \
+   --cache-from myapp:latest \
+   --build-arg BUILDKIT_INLINE_CACHE=1 \
+   -t myapp:$CI_COMMIT_SHA \
+   -t myapp:latest .

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Add a pre-build disk check to your pipeline

# .gitlab-ci.yml / GitHub Actions equivalent
before_script:
  - |
    USAGE=$(df /var/lib/docker | awk 'NR==2 {print $5}' | tr -d '%')
    if [ "$USAGE" -gt 80 ]; then
      echo "[WARN] Docker disk at ${USAGE}%. Running prune."
      docker system prune -f
      docker builder prune -f --keep-storage 5GB
    fi

2. Enforce image size limits with a Dockerfile linter

# Hadolint — catches layer-bloating anti-patterns before they hit CI
docker run --rm -i hadolint/hadolint < Dockerfile

# Dockle — CIS benchmark + size audit
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
  goodwithtech/dockle myapp:latest

3. OPA/Conftest policy — block builds without multi-stage

# policy/dockerfile.rego
package dockerfile

deny[msg] {
  input[i].Cmd == "from"
  count([s | input[s].Cmd == "from"]) == 1
  input[j].Cmd == "run"
  contains(input[j].Value[0], "apt-get install")
  msg := "Single-stage build with apt installs detected. Use multi-stage to reduce layer bloat."
}

4. Mount /var/lib/docker on a dedicated volume

On EC2/GCP/Azure, never let Docker share the root EBS volume. Mount a dedicated 100GB+ gp3 volume at /var/lib/docker. Set CloudWatch/Prometheus alerts at 70% and 85% utilization — not 95%.

# Alert threshold in Prometheus
- alert: DockerDiskPressure
  expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/docker"} / node_filesystem_size_bytes{mountpoint="/var/lib/docker"}) < 0.20
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Docker storage below 20% free on {{ $labels.instance }}"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →