How to Fix 'dockerd failed to start: pid file /var/run/docker.pid exists' on Linux
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 3 mins
TL;DR
- What broke:
dockerdfound a leftover/var/run/docker.pidfrom a previous crash or SIGKILL, refused to start to avoid double-daemon corruption. - How to fix it: Verify no live
dockerdprocess owns that PID, delete the stale file, and restart the service. - Use our Client-Side Sandbox above to paste your
journalctloutput and auto-generate the exact remediation commands for your distro.
The Incident (What Does the Error Mean?)
Raw error surface — typically visible in systemctl status docker or journalctl -xe:
Failed to start Docker Application Container Engine.
dockerd failed to start: pid file found, ensure docker is not running or delete /var/run/docker.pid
-- or --
Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid
Immediate consequence: The Docker daemon is completely down. Every docker CLI call returns Cannot connect to the Docker daemon at unix:///var/run/docker.sock. All containers are unreachable. On a production host this means zero new container starts, failed health checks, and if you're running a Swarm node, it drops out of the cluster quorum.
The Attack Vector / Blast Radius
This is a split-brain protection mechanism gone wrong. dockerd writes its PID to /var/run/docker.pid on startup and removes it on clean shutdown. If the host OOM-killed dockerd, lost power, or a sysadmin ran kill -9, the file survives.
Cascading failure chain:
- Host reboots or
dockerdcrashes hard → PID file not cleaned up. - systemd attempts auto-restart → blocked immediately by PID file check.
RestartPolicyloops hitStartLimitBurst→ systemd marks the unit failed, stops retrying.- Any orchestration layer (Swarm, Nomad, custom health scripts) waiting on
/var/run/docker.socktimes out. - Dependent services (log shippers, monitoring agents running as containers) go dark silently.
The dangerous assumption: Engineers sometimes see this and blindly rm the PID file without checking if a zombie dockerd process actually holds it. Starting a second daemon instance against the same storage graph (/var/lib/docker) causes overlay filesystem corruption.
How to Fix It (The Solution)
Step 1 — Verify No Live dockerd Process Exists
# Check if the PID in the file maps to a real, running dockerd
cat /var/run/docker.pid
# e.g., outputs: 3821
ps -p 3821 -o comm=
# If output is blank or returns a non-docker process, the file is stale.
# If it returns 'dockerd', a daemon IS running — do NOT delete the file.
Basic Fix — Remove Stale PID File
- # Blindly restarting without checking:
- sudo systemctl restart docker
+ # Safe remediation sequence:
+ PID=$(cat /var/run/docker.pid 2>/dev/null)
+ if [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd; then
+ echo "Stale PID file confirmed. Removing."
+ sudo rm -f /var/run/docker.pid
+ sudo systemctl start docker
+ else
+ echo "Live dockerd process detected. Investigate before proceeding."
+ fi
Enterprise Best Practice — systemd ExecStartPre Cleanup
Do not rely on manual intervention. Patch the systemd unit via a drop-in override so the daemon self-heals on every restart:
# File: /etc/systemd/system/docker.service.d/pid-cleanup.conf
- # No override — default unit has no PID cleanup logic
+ [Service]
+ ExecStartPre=-/bin/sh -c 'PID=$(cat /var/run/docker.pid 2>/dev/null); \
+ [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd && rm -f /var/run/docker.pid || true'
Apply it:
sudo mkdir -p /etc/systemd/system/docker.service.d
# Write the drop-in above, then:
sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl status docker
The -= prefix on ExecStartPre means systemd treats a non-zero exit as non-fatal — critical so a legitimate running daemon doesn't get its PID file nuked.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Packer / AMI bake pipelines: If you bake Docker into a golden AMI, ensure your post-install script explicitly removes the PID file before snapshotting:
sudo systemctl stop docker
sudo rm -f /var/run/docker.pid /var/run/docker.sock
2. Ansible / Chef convergence runs: Add an idempotent task before the docker service resource:
# Ansible example
- name: Remove stale Docker PID file if process is absent
file:
path: /var/run/docker.pid
state: absent
when: "'dockerd' not in ansible_facts.services"
3. Kubernetes node bootstrap (kubeadm / eksctl): Node bootstrap scripts should include the systemd drop-in above as a cloud-init step. A node that fails to start Docker never becomes Ready, but it also won't surface a clear error without this — it just hangs in NotReady indefinitely.
4. Checkov / Terraform for launch templates: If your EC2 launch template user-data starts Docker, add a Checkov custom check or a null_resource local-exec that validates the drop-in is present in the AMI before rolling an ASG update.
5. Monitoring: Alert on systemd unit state, not just process count:
# Prometheus node_exporter systemd collector — alert rule
- alert: DockerDaemonFailed
expr: node_systemd_unit_state{name="docker.service",state="failed"} == 1
for: 1m
labels:
severity: critical