How to Fix Docker HEALTHCHECK Command Failed Exit 1: Diagnosing Unhealthy Containers
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins
TL;DR
- What broke: The
HEALTHCHECKcommand in your Dockerfile exits with code1, so Docker marks the containerunhealthy, triggering restart loops or orchestrator evictions (ECS task replacement, K8s pod kill). - How to fix it: Verify the health probe binary exists inside the image, the target port/path is correct, and
--start-period/--interval/--timeoutare tuned to your app's actual startup time. - Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Dockerfile and get a corrected
HEALTHCHECKinstruction without sending your config to a third-party server.
The Incident (What Does the Error Mean?)
Raw output from docker inspect or docker events:
Status: unhealthy
Log:
ExitCode: 1
Output: "curl: (7) Failed to connect to localhost port 8080: Connection refused"
ExitCode: 1
Output: "/bin/sh: curl: not found"
Docker executes the HEALTHCHECK command inside the running container at each --interval. An exit code of 1 is the explicit unhealthy signal. An exit code of 2 means reserved/failed probe. Exit 0 is the only healthy state.
Immediate consequence: In standalone Docker, the container stays running but is flagged unhealthy. In ECS, Docker Swarm, or Kubernetes (via liveness probes mapped to healthcheck), the orchestrator will drain and replace the task/pod after --retries consecutive failures — causing a real outage if your entire service rolls over simultaneously.
The Attack Vector / Blast Radius
This is a reliability and availability failure, not a direct CVE — but the blast radius is severe:
- Restart storm: If
--retries 3and--interval 10s, your container is declared unhealthy in 30 seconds. ECS replaces it. The new container also fails. You get a thundering-herd restart loop that exhausts task placement capacity. - Silent misconfiguration masking a real crash: If your app process died but the healthcheck was already broken/misconfigured, Docker reports
unhealthyfor the wrong reason. You chase the healthcheck red herring while the actual application panic goes unnoticed in logs. - Load balancer target deregistration: ALB/NLB target groups tied to ECS health status will deregister all targets if the healthcheck is universally broken across a deployment, taking the service to zero capacity.
- Distroless/minimal image trap: Engineers copy a
curl-based healthcheck from a Ubuntu-based image into agcr.io/distrolessoralpineimage wherecurldoesn't exist. The probe fails 100% of the time from first start.
How to Fix It
Root Cause Checklist
Before touching the Dockerfile, run this inside the container:
# Get a shell into the unhealthy container
docker exec -it <container_id> /bin/sh
# Manually run your healthcheck command
curl -f http://localhost:8080/health
# or
wget -qO- http://localhost:8080/health
# Check if the binary exists
which curl || which wget
# Check if the port is actually listening
ss -tlnp | grep 8080
# or
netstat -tlnp
Basic Fix — curl not found in minimal image
FROM node:18-alpine
- RUN npm ci --omit=dev
+ RUN npm ci --omit=dev && apk add --no-cache curl
EXPOSE 8080
- HEALTHCHECK --interval=10s --timeout=3s \
- CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
+ CMD curl -f http://localhost:8080/health || exit 1
Why --start-period matters: Without it, Docker starts counting healthcheck failures from container start. A Node/JVM app that takes 10s to boot will fail 1–2 probes before it's even ready, burning through retries before the app is healthy.
Enterprise Best Practice — Use a native process check, no external binary dependency
For distroless, scratch, or hardened images where you cannot install curl:
FROM gcr.io/distroless/nodejs18-debian11
- HEALTHCHECK --interval=10s --timeout=3s \
- CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
+ CMD ["node", "-e", \
+ "require('http').get('http://localhost:8080/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"]
For Go binaries, compile a dedicated /healthcheck binary into the image:
FROM scratch
COPY --from=builder /app/server /server
+ COPY --from=builder /app/healthcheck /healthcheck
- HEALTHCHECK CMD ["/bin/sh", "-c", "curl localhost:8080"]
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+ CMD ["/healthcheck"]
The healthcheck binary is a minimal Go HTTP client compiled into the image — zero shell, zero curl dependency, works in scratch or distroless.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Hadolint — Dockerfile linter catches bad HEALTHCHECK patterns
# .github/workflows/lint.yml
- name: Lint Dockerfile
uses: hadolint/[email protected]
with:
dockerfile: Dockerfile
failure-threshold: warning
Hadolint rule DL3059 flags missing --start-period. Rule DL4006 flags shell form healthchecks that can silently swallow errors.
2. Checkov — IaC policy scan for ECS task definitions
checkov -d . --check CKV_DOCKER_2
# CKV_DOCKER_2: Ensure that HEALTHCHECK instructions have been added to container images
3. OPA/Conftest policy — enforce start-period minimum
# policy/healthcheck.rego
package docker
deny[msg] {
input.Healthcheck.Test
not input.Healthcheck.StartPeriod
msg := "HEALTHCHECK must define --start-period to prevent false failures during app startup"
}
deny[msg] {
input.Healthcheck.Interval < 15000000000 # 15s in nanoseconds
msg := "HEALTHCHECK --interval must be >= 15s to avoid restart storms"
}
conftest test docker-image-config.json --policy policy/
4. Integration test — assert container reaches healthy state in CI
#!/bin/bash
docker build -t myapp:test .
docker run -d --name test_container myapp:test
# Poll for healthy status, fail pipeline if it never reaches healthy
for i in $(seq 1 12); do
STATUS=$(docker inspect --format='{{.State.Health.Status}}' test_container)
[ "$STATUS" = "healthy" ] && echo "Container healthy" && exit 0
echo "Attempt $i: status=$STATUS"
sleep 5
done
echo "FATAL: Container never reached healthy state"
docker logs test_container
docker inspect test_container | jq '.State.Health'
exit 1