Is it safe to delete /var/run/docker.pid?

Only if no live dockerd process owns that PID. Run `ps -p $(cat /var/run/docker.pid) -o comm=` first. If the output is blank or shows a non-Docker process, the file is stale and safe to remove. If it shows 'dockerd', a daemon is actually running and deleting the file will not help — you have a different problem (likely a socket permission issue).

Why does this happen after a server reboot?

On an unclean shutdown (kernel panic, OOM kill, hard power loss), systemd does not execute the docker service's ExecStopPost cleanup. The PID file in /var/run survives because /var/run is typically a tmpfs that is only cleared on a full, clean reboot cycle. If the reboot was itself unclean, the tmpfs may persist its last state depending on kernel and distro configuration.

Will this cause data loss or container state corruption?

Removing a confirmed stale PID file and restarting dockerd cleanly will not corrupt container state. Docker's storage graph in /var/lib/docker is separate. However, any containers that were running at the time of the crash may have inconsistent overlay mounts. Run `docker ps -a` after restart and inspect any containers in 'Exited' or 'Dead' state. Volumes are unaffected.

How to Fix 'dockerd failed to start: pid file /var/run/docker.pid exists' on Linux

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 3 mins

TL;DR

What broke: dockerd found a leftover /var/run/docker.pid from a previous crash or SIGKILL, refused to start to avoid double-daemon corruption.
How to fix it: Verify no live dockerd process owns that PID, delete the stale file, and restart the service.
Use our Client-Side Sandbox above to paste your journalctl output and auto-generate the exact remediation commands for your distro.

The Incident (What Does the Error Mean?)

Raw error surface — typically visible in systemctl status docker or journalctl -xe:

Failed to start Docker Application Container Engine.
dockerd failed to start: pid file found, ensure docker is not running or delete /var/run/docker.pid
-- or --
Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid

Immediate consequence: The Docker daemon is completely down. Every docker CLI call returns Cannot connect to the Docker daemon at unix:///var/run/docker.sock. All containers are unreachable. On a production host this means zero new container starts, failed health checks, and if you're running a Swarm node, it drops out of the cluster quorum.

The Attack Vector / Blast Radius

This is a split-brain protection mechanism gone wrong. dockerd writes its PID to /var/run/docker.pid on startup and removes it on clean shutdown. If the host OOM-killed dockerd, lost power, or a sysadmin ran kill -9, the file survives.

Cascading failure chain:

Host reboots or dockerd crashes hard → PID file not cleaned up.
systemd attempts auto-restart → blocked immediately by PID file check.
RestartPolicy loops hit StartLimitBurst → systemd marks the unit failed, stops retrying.
Any orchestration layer (Swarm, Nomad, custom health scripts) waiting on /var/run/docker.sock times out.
Dependent services (log shippers, monitoring agents running as containers) go dark silently.

The dangerous assumption: Engineers sometimes see this and blindly rm the PID file without checking if a zombie dockerd process actually holds it. Starting a second daemon instance against the same storage graph (/var/lib/docker) causes overlay filesystem corruption.

How to Fix It (The Solution)

Step 1 — Verify No Live dockerd Process Exists

# Check if the PID in the file maps to a real, running dockerd
cat /var/run/docker.pid
# e.g., outputs: 3821

ps -p 3821 -o comm=
# If output is blank or returns a non-docker process, the file is stale.
# If it returns 'dockerd', a daemon IS running — do NOT delete the file.

Basic Fix — Remove Stale PID File

- # Blindly restarting without checking:
- sudo systemctl restart docker

+ # Safe remediation sequence:
+ PID=$(cat /var/run/docker.pid 2>/dev/null)
+ if [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd; then
+   echo "Stale PID file confirmed. Removing."
+   sudo rm -f /var/run/docker.pid
+   sudo systemctl start docker
+ else
+   echo "Live dockerd process detected. Investigate before proceeding."
+ fi

Enterprise Best Practice — systemd ExecStartPre Cleanup

Do not rely on manual intervention. Patch the systemd unit via a drop-in override so the daemon self-heals on every restart:

# File: /etc/systemd/system/docker.service.d/pid-cleanup.conf

- # No override — default unit has no PID cleanup logic

+ [Service]
+ ExecStartPre=-/bin/sh -c 'PID=$(cat /var/run/docker.pid 2>/dev/null); \
+   [ -n "$PID" ] && ! ps -p "$PID" -o comm= | grep -q dockerd && rm -f /var/run/docker.pid || true'

Apply it:

sudo mkdir -p /etc/systemd/system/docker.service.d
# Write the drop-in above, then:
sudo systemctl daemon-reload
sudo systemctl restart docker
sudo systemctl status docker

The -= prefix on ExecStartPre means systemd treats a non-zero exit as non-fatal — critical so a legitimate running daemon doesn't get its PID file nuked.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Packer / AMI bake pipelines: If you bake Docker into a golden AMI, ensure your post-install script explicitly removes the PID file before snapshotting:

sudo systemctl stop docker
sudo rm -f /var/run/docker.pid /var/run/docker.sock

2. Ansible / Chef convergence runs: Add an idempotent task before the docker service resource:

# Ansible example
- name: Remove stale Docker PID file if process is absent
  file:
    path: /var/run/docker.pid
    state: absent
  when: "'dockerd' not in ansible_facts.services"

3. Kubernetes node bootstrap (kubeadm / eksctl): Node bootstrap scripts should include the systemd drop-in above as a cloud-init step. A node that fails to start Docker never becomes Ready, but it also won't surface a clear error without this — it just hangs in NotReady indefinitely.

4. Checkov / Terraform for launch templates: If your EC2 launch template user-data starts Docker, add a Checkov custom check or a null_resource local-exec that validates the drop-in is present in the AMI before rolling an ASG update.

5. Monitoring: Alert on systemd unit state, not just process count:

# Prometheus node_exporter systemd collector — alert rule
- alert: DockerDaemonFailed
  expr: node_systemd_unit_state{name="docker.service",state="failed"} == 1
  for: 1m
  labels:
    severity: critical