Initializing Enclave...

How to Fix 'Container Is Not Running' Error After Unexpected Docker Stop

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins


TL;DR

  • What broke: The container process exited (crash, OOM kill, or bad entrypoint) — Docker preserves the dead container in Exited state, making docker exec impossible.
  • How to fix it: Identify the exit code via docker inspect, fix the root cause, then restart with docker start <container> or rebuild with a hardened entrypoint.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your docker-compose.yml or Dockerfile — paste your config and get a patched version with restart policies and health checks generated instantly.

The Incident (What Does the Error Mean?)

Raw error:

$ docker exec -it my_app sh
Error response from daemon: Container 3f2a1b9c0d4e is not running

docker exec requires a running PID 1 inside the container. When the container's main process exits for any reason, Docker transitions the container to Exited state. The container filesystem is preserved but no process is alive to attach to. docker exec is dead in the water until PID 1 is alive again.

Confirm the state immediately:

docker ps -a --filter "name=my_app" --format "table {{.Names}}\t{{.Status}}\t{{.ExitCode}}\t{{.ID}}"

You'll see Exited (1) or Exited (137) — that exit code is your primary diagnostic signal.

Exit code cheat sheet:

Exit Code Meaning
0 Clean exit — your CMD finished normally
1 Application error / unhandled exception
137 SIGKILL — OOM killer or docker kill
139 Segfault
143 SIGTERM — graceful stop requested

The Attack Vector / Blast Radius

This is not just an inconvenience. A container in Exited state in production means:

  • No exec access — you cannot shell in to gather diagnostics while the incident is live. If you didn't configure logging drivers or a sidecar, you are blind.
  • Cascading dependency failures — any service depending on this container (via Docker network DNS) starts throwing connection refused. In a compose stack, this fans out fast.
  • OOMKill loop risk — if restart: always is set without fixing the memory leak, Docker respawns the container into a crash loop, consuming host CPU and masking the real problem in a wall of restart logs.
  • Data corruption risk — if the container was mid-write to a bind-mounted volume when it was killed (exit 137), you may have partially written files. Check your application's write-ahead log or journal.
  • Silent failure in orchestrators — in Swarm or raw Docker on a VM, without an external health monitor, this container sits dead and no alert fires unless you have explicit health checks wired to your alerting stack.

Forensics first — always run this before restarting:

# Get the full state, OOMKilled flag, and exit code
docker inspect my_app | jq '.[0].State'

# Get the last 200 lines of logs from the dead container
docker logs --tail 200 my_app

# Check host-level OOM events
dmesg | grep -i 'oom\|killed process' | tail -20

How to Fix It (The Solution)

Basic Fix — Restart the Container

Once you've pulled logs and identified the exit code:

# Restart the existing stopped container
docker start my_app

# Verify it's running
docker ps --filter "name=my_app"

# Attach exec once confirmed running
docker exec -it my_app sh

⚠️ Do not skip the forensics step above. Restarting without root cause analysis puts you back in the same crash loop in minutes.


Enterprise Best Practice — Harden the Container Config

Problem: No restart policy, no health check, no entrypoint guard.

# docker-compose.yml

 services:
   app:
     image: my_app:latest
-    restart: "no"
+    restart: unless-stopped
+    mem_limit: 512m
+    memswap_limit: 512m
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
+      interval: 15s
+      timeout: 5s
+      retries: 3
+      start_period: 10s
     environment:
       - APP_ENV=production

Problem: Dockerfile entrypoint exits immediately (classic script-exits-zero issue).

# Dockerfile

- CMD ["./start.sh"]
+ # Ensure PID 1 is a long-running process, not a fire-and-forget script
+ ENTRYPOINT ["./docker-entrypoint.sh"]
+ CMD ["./app-server", "--port", "8080"]
# docker-entrypoint.sh

  #!/bin/sh
  set -e

+ # Trap SIGTERM for graceful shutdown
+ trap 'echo "SIGTERM received, shutting down"; exit 0' TERM
+
  # Run migrations or init tasks
  ./migrate.sh

- ./app-server &
+ # exec replaces shell with app — PID 1 is now the app, not sh
+ exec "$@"

Using exec "$@" is non-negotiable. Without it, your shell script is PID 1, the app is a child process, and SIGTERM never reaches the app — Docker force-kills after the stop timeout, producing exit 137 every time.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint Dockerfiles in your pipeline with Hadolint:

# .github/workflows/docker-lint.yml
- name: Lint Dockerfile
  uses: hadolint/[email protected]
  with:
    dockerfile: Dockerfile
    failure-threshold: warning

Hadolint rule DL3025 flags CMD used with shell form instead of exec form. Rule DL3006 catches missing health checks.

2. Enforce health checks with OPA/Conftest:

# policy/docker_compose.rego
package docker_compose

deny[msg] {
  service := input.services[name]
  not service.healthcheck
  msg := sprintf("Service '%v' is missing a healthcheck definition", [name])
}

deny[msg] {
  service := input.services[name]
  not service.restart
  msg := sprintf("Service '%v' has no restart policy", [name])
}
# Run in CI before deploy
conftest test docker-compose.yml --policy policy/

3. Monitor container state with a dead-simple Prometheus alert:

# alerting rules
- alert: ContainerNotRunning
  expr: time() - container_last_seen{name=~"my_app.*"} > 60
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Container {{ $labels.name }} has been down for over 60s"

4. Set explicit stop_grace_period in compose to avoid SIGKILL races:

services:
  app:
    stop_grace_period: 30s

This gives your app 30 seconds to handle SIGTERM cleanly before Docker escalates to SIGKILL (exit 137). Default is 10 seconds — often not enough for apps draining in-flight requests.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →