Initializing Enclave...

How to Fix Docker HEALTHCHECK Command Failed Exit 1: Diagnosing Unhealthy Containers

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins

TL;DR

  • What broke: The HEALTHCHECK command in your Dockerfile exits with code 1, so Docker marks the container unhealthy, triggering restart loops or orchestrator evictions (ECS task replacement, K8s pod kill).
  • How to fix it: Verify the health probe binary exists inside the image, the target port/path is correct, and --start-period/--interval/--timeout are tuned to your app's actual startup time.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Dockerfile and get a corrected HEALTHCHECK instruction without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw output from docker inspect or docker events:

Status: unhealthy
Log:
  ExitCode: 1
  Output: "curl: (7) Failed to connect to localhost port 8080: Connection refused"
  ExitCode: 1
  Output: "/bin/sh: curl: not found"

Docker executes the HEALTHCHECK command inside the running container at each --interval. An exit code of 1 is the explicit unhealthy signal. An exit code of 2 means reserved/failed probe. Exit 0 is the only healthy state.

Immediate consequence: In standalone Docker, the container stays running but is flagged unhealthy. In ECS, Docker Swarm, or Kubernetes (via liveness probes mapped to healthcheck), the orchestrator will drain and replace the task/pod after --retries consecutive failures — causing a real outage if your entire service rolls over simultaneously.


The Attack Vector / Blast Radius

This is a reliability and availability failure, not a direct CVE — but the blast radius is severe:

  1. Restart storm: If --retries 3 and --interval 10s, your container is declared unhealthy in 30 seconds. ECS replaces it. The new container also fails. You get a thundering-herd restart loop that exhausts task placement capacity.
  2. Silent misconfiguration masking a real crash: If your app process died but the healthcheck was already broken/misconfigured, Docker reports unhealthy for the wrong reason. You chase the healthcheck red herring while the actual application panic goes unnoticed in logs.
  3. Load balancer target deregistration: ALB/NLB target groups tied to ECS health status will deregister all targets if the healthcheck is universally broken across a deployment, taking the service to zero capacity.
  4. Distroless/minimal image trap: Engineers copy a curl-based healthcheck from a Ubuntu-based image into a gcr.io/distroless or alpine image where curl doesn't exist. The probe fails 100% of the time from first start.

How to Fix It

Root Cause Checklist

Before touching the Dockerfile, run this inside the container:

# Get a shell into the unhealthy container
docker exec -it <container_id> /bin/sh

# Manually run your healthcheck command
curl -f http://localhost:8080/health
# or
wget -qO- http://localhost:8080/health

# Check if the binary exists
which curl || which wget

# Check if the port is actually listening
ss -tlnp | grep 8080
# or
netstat -tlnp

Basic Fix — curl not found in minimal image

  FROM node:18-alpine

- RUN npm ci --omit=dev
+ RUN npm ci --omit=dev && apk add --no-cache curl

  EXPOSE 8080

- HEALTHCHECK --interval=10s --timeout=3s \
-   CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
+   CMD curl -f http://localhost:8080/health || exit 1

Why --start-period matters: Without it, Docker starts counting healthcheck failures from container start. A Node/JVM app that takes 10s to boot will fail 1–2 probes before it's even ready, burning through retries before the app is healthy.


Enterprise Best Practice — Use a native process check, no external binary dependency

For distroless, scratch, or hardened images where you cannot install curl:

  FROM gcr.io/distroless/nodejs18-debian11

- HEALTHCHECK --interval=10s --timeout=3s \
-   CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
+   CMD ["node", "-e", \
+     "require('http').get('http://localhost:8080/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"]

For Go binaries, compile a dedicated /healthcheck binary into the image:

  FROM scratch
  COPY --from=builder /app/server /server
+ COPY --from=builder /app/healthcheck /healthcheck

- HEALTHCHECK CMD ["/bin/sh", "-c", "curl localhost:8080"]
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+   CMD ["/healthcheck"]

The healthcheck binary is a minimal Go HTTP client compiled into the image — zero shell, zero curl dependency, works in scratch or distroless.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Hadolint — Dockerfile linter catches bad HEALTHCHECK patterns

# .github/workflows/lint.yml
- name: Lint Dockerfile
  uses: hadolint/[email protected]
  with:
    dockerfile: Dockerfile
    failure-threshold: warning

Hadolint rule DL3059 flags missing --start-period. Rule DL4006 flags shell form healthchecks that can silently swallow errors.

2. Checkov — IaC policy scan for ECS task definitions

checkov -d . --check CKV_DOCKER_2
# CKV_DOCKER_2: Ensure that HEALTHCHECK instructions have been added to container images

3. OPA/Conftest policy — enforce start-period minimum

# policy/healthcheck.rego
package docker

deny[msg] {
  input.Healthcheck.Test
  not input.Healthcheck.StartPeriod
  msg := "HEALTHCHECK must define --start-period to prevent false failures during app startup"
}

deny[msg] {
  input.Healthcheck.Interval < 15000000000  # 15s in nanoseconds
  msg := "HEALTHCHECK --interval must be >= 15s to avoid restart storms"
}
conftest test docker-image-config.json --policy policy/

4. Integration test — assert container reaches healthy state in CI

#!/bin/bash
docker build -t myapp:test .
docker run -d --name test_container myapp:test

# Poll for healthy status, fail pipeline if it never reaches healthy
for i in $(seq 1 12); do
  STATUS=$(docker inspect --format='{{.State.Health.Status}}' test_container)
  [ "$STATUS" = "healthy" ] && echo "Container healthy" && exit 0
  echo "Attempt $i: status=$STATUS"
  sleep 5
done

echo "FATAL: Container never reached healthy state"
docker logs test_container
docker inspect test_container | jq '.State.Health'
exit 1

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →