Initializing Enclave...

How to Fix 'dockerd failed to start: pid file /var/run/docker.pid exists' on Linux

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 3 mins

TL;DR

  • What broke: dockerd found a leftover /var/run/docker.pid from a previous crash or SIGKILL, refused to start to avoid double-daemon corruption.
  • How to fix it: Verify no live dockerd process owns that PID, delete the stale file, and restart the service.
  • Use our Client-Side Sandbox above to paste your journalctl output and auto-generate the exact remediation commands for your distro.

The Incident (What Does the Error Mean?)

Raw error surface — typically visible in systemctl status docker or journalctl -xe:

Failed to start Docker Application Container Engine.
dockerd failed to start: pid file found, ensure docker is not running or delete /var/run/docker.pid
-- or --
Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid

Immediate consequence: The Docker daemon is completely down. Every docker CLI call returns Cannot connect to the Docker daemon at unix:///var/run/docker.sock. All containers are unreachable. On a production host this means zero new container starts, failed health checks, and if you're running a Swarm node, it drops out of the cluster quorum.


The Attack Vector / Blast Radius

This is a split-brain protection mechanism gone wrong. dockerd writes its PID to /var/run/docker.pid on startup and removes it on clean shutdown. If the host OOM-killed dockerd, lost power, or a sysadmin ran kill -9, the file survives.

Cascading failure chain:

  1. Host reboots or dockerd crashes hard → PID file not cleaned up.
  2. systemd attempts auto-restart → blocked immediately by PID file check.
  3. RestartPolicy loops hit StartLimitBurst → systemd marks the unit failed, stops retrying.
  4. Any orchestration layer (Swarm, Nomad, custom health scripts) waiting on /var/run/docker.sock times out.
  5. Dependent services (log shippers, monitoring agents running as containers) go dark silently.

The dangerous assumption: Engineers sometimes see this and blindly rm the PID file without checking if a zombie dockerd process actually holds it. Starting a second daemon instance against the same storage graph (/var/lib/docker) causes overlay filesystem corruption.


How to Fix It (The Solution)

Step 1 — Verify No Live dockerd Process Exists

# Check if the PID in the file maps to a real, running dockerd
cat /var/run/docker.pid
# e.g., outputs: 3821

ps -p 3821 -o comm=
# If output is blank or returns a non-docker process, the file is stale.
# If it returns 'dockerd', a daemon IS running — do NOT delete the file.

Basic Fix — Remove Stale PID File

- # Blindly restarting without checking:
- sudo systemctl restart docker

+ # Safe remediation sequence:
+ PID=$(cat /var/run/docker.pid 2>/dev/null)
+ if [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd; then
+   echo "Stale PID file confirmed. Removing."
+   sudo rm -f /var/run/docker.pid
+   sudo systemctl start docker
+ else
+   echo "Live dockerd process detected. Investigate before proceeding."
+ fi

Enterprise Best Practice — systemd ExecStartPre Cleanup

Do not rely on manual intervention. Patch the systemd unit via a drop-in override so the daemon self-heals on every restart:

# File: /etc/systemd/system/docker.service.d/pid-cleanup.conf

- # No override — default unit has no PID cleanup logic

+ [Service]
+ ExecStartPre=-/bin/sh -c 'PID=$(cat /var/run/docker.pid 2>/dev/null); \
+   [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd && rm -f /var/run/docker.pid || true'

Apply it:

sudo mkdir -p /etc/systemd/system/docker.service.d
# Write the drop-in above, then:
sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl status docker

The -= prefix on ExecStartPre means systemd treats a non-zero exit as non-fatal — critical so a legitimate running daemon doesn't get its PID file nuked.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Packer / AMI bake pipelines: If you bake Docker into a golden AMI, ensure your post-install script explicitly removes the PID file before snapshotting:

sudo systemctl stop docker
sudo rm -f /var/run/docker.pid /var/run/docker.sock

2. Ansible / Chef convergence runs: Add an idempotent task before the docker service resource:

# Ansible example
- name: Remove stale Docker PID file if process is absent
  file:
    path: /var/run/docker.pid
    state: absent
  when: "'dockerd' not in ansible_facts.services"

3. Kubernetes node bootstrap (kubeadm / eksctl): Node bootstrap scripts should include the systemd drop-in above as a cloud-init step. A node that fails to start Docker never becomes Ready, but it also won't surface a clear error without this — it just hangs in NotReady indefinitely.

4. Checkov / Terraform for launch templates: If your EC2 launch template user-data starts Docker, add a Checkov custom check or a null_resource local-exec that validates the drop-in is present in the AMI before rolling an ASG update.

5. Monitoring: Alert on systemd unit state, not just process count:

# Prometheus node_exporter systemd collector — alert rule
- alert: DockerDaemonFailed
  expr: node_systemd_unit_state{name="docker.service",state="failed"} == 1
  for: 1m
  labels:
    severity: critical

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →