Why does my Docker HEALTHCHECK fail immediately even though the app is running?

The most common cause is a missing `--start-period` parameter. Without it, Docker begins evaluating health from the moment the container starts. If your application takes 5–20 seconds to initialize (Node.js, JVM, Python Django), the first 1–3 probes will fail before the app is ready to serve traffic, burning through your `--retries` budget and marking the container unhealthy before it ever had a chance. Add `--start-period= s` to give the app a grace period before failures count.

How do I fix 'curl: not found' in a Docker HEALTHCHECK on Alpine or distroless images?

You have two options. Option 1 (Alpine): Add `apk add --no-cache curl` to your RUN layer before the HEALTHCHECK instruction. Option 2 (distroless/scratch — preferred): Replace the curl-based probe with a native runtime call. For Node.js use `CMD ["node", "-e", "require('http').get(...)"]`. For Go, compile a dedicated lightweight healthcheck binary and COPY it into the image. Never install curl into a hardened distroless image just for health probing — it expands your attack surface.

What is the difference between Docker HEALTHCHECK exit code 1 and exit code 2?

Exit code `0` = healthy. Exit code `1` = unhealthy (your probe ran successfully but determined the service is not healthy — e.g., HTTP 500, connection refused). Exit code `2` = reserved, indicates the healthcheck mechanism itself is broken or the probe binary crashed unexpectedly. In practice, most production failures are exit `1` from `curl -f` receiving a non-2xx HTTP response, or the TCP connection being refused because the app hasn't bound to the port yet.

How to Fix Docker HEALTHCHECK Command Failed Exit 1: Diagnosing Unhealthy Containers

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins

TL;DR

What broke: The HEALTHCHECK command in your Dockerfile exits with code 1, so Docker marks the container unhealthy, triggering restart loops or orchestrator evictions (ECS task replacement, K8s pod kill).
How to fix it: Verify the health probe binary exists inside the image, the target port/path is correct, and --start-period/--interval/--timeout are tuned to your app's actual startup time.
Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Dockerfile and get a corrected HEALTHCHECK instruction without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw output from docker inspect or docker events:

Status: unhealthy
Log:
  ExitCode: 1
  Output: "curl: (7) Failed to connect to localhost port 8080: Connection refused"
  ExitCode: 1
  Output: "/bin/sh: curl: not found"

Docker executes the HEALTHCHECK command inside the running container at each --interval. An exit code of 1 is the explicit unhealthy signal. An exit code of 2 means reserved/failed probe. Exit 0 is the only healthy state.

Immediate consequence: In standalone Docker, the container stays running but is flagged unhealthy. In ECS, Docker Swarm, or Kubernetes (via liveness probes mapped to healthcheck), the orchestrator will drain and replace the task/pod after --retries consecutive failures — causing a real outage if your entire service rolls over simultaneously.

The Attack Vector / Blast Radius

This is a reliability and availability failure, not a direct CVE — but the blast radius is severe:

Restart storm: If --retries 3 and --interval 10s, your container is declared unhealthy in 30 seconds. ECS replaces it. The new container also fails. You get a thundering-herd restart loop that exhausts task placement capacity.
Silent misconfiguration masking a real crash: If your app process died but the healthcheck was already broken/misconfigured, Docker reports unhealthy for the wrong reason. You chase the healthcheck red herring while the actual application panic goes unnoticed in logs.
Load balancer target deregistration: ALB/NLB target groups tied to ECS health status will deregister all targets if the healthcheck is universally broken across a deployment, taking the service to zero capacity.
Distroless/minimal image trap: Engineers copy a curl-based healthcheck from a Ubuntu-based image into a gcr.io/distroless or alpine image where curl doesn't exist. The probe fails 100% of the time from first start.

How to Fix It

Root Cause Checklist

Before touching the Dockerfile, run this inside the container:

# Get a shell into the unhealthy container
docker exec -it <container_id> /bin/sh

# Manually run your healthcheck command
curl -f http://localhost:8080/health
# or
wget -qO- http://localhost:8080/health

# Check if the binary exists
which curl || which wget

# Check if the port is actually listening
ss -tlnp | grep 8080
# or
netstat -tlnp

Basic Fix — curl not found in minimal image

  FROM node:18-alpine

- RUN npm ci --omit=dev
+ RUN npm ci --omit=dev && apk add --no-cache curl

  EXPOSE 8080

- HEALTHCHECK --interval=10s --timeout=3s \
-   CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
+   CMD curl -f http://localhost:8080/health || exit 1

Why --start-period matters: Without it, Docker starts counting healthcheck failures from container start. A Node/JVM app that takes 10s to boot will fail 1–2 probes before it's even ready, burning through retries before the app is healthy.

Enterprise Best Practice — Use a native process check, no external binary dependency

For distroless, scratch, or hardened images where you cannot install curl:

  FROM gcr.io/distroless/nodejs18-debian11

- HEALTHCHECK --interval=10s --timeout=3s \
-   CMD curl -f http://localhost:8080/health || exit 1
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
+   CMD ["node", "-e", \
+     "require('http').get('http://localhost:8080/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1)).on('error', () => process.exit(1))"]

For Go binaries, compile a dedicated /healthcheck binary into the image:

  FROM scratch
  COPY --from=builder /app/server /server
+ COPY --from=builder /app/healthcheck /healthcheck

- HEALTHCHECK CMD ["/bin/sh", "-c", "curl localhost:8080"]
+ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+   CMD ["/healthcheck"]

The healthcheck binary is a minimal Go HTTP client compiled into the image — zero shell, zero curl dependency, works in scratch or distroless.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Hadolint — Dockerfile linter catches bad HEALTHCHECK patterns

# .github/workflows/lint.yml
- name: Lint Dockerfile
  uses: hadolint/[email protected]
  with:
    dockerfile: Dockerfile
    failure-threshold: warning

Hadolint rule DL3059 flags missing --start-period. Rule DL4006 flags shell form healthchecks that can silently swallow errors.

2. Checkov — IaC policy scan for ECS task definitions

checkov -d . --check CKV_DOCKER_2
# CKV_DOCKER_2: Ensure that HEALTHCHECK instructions have been added to container images

3. OPA/Conftest policy — enforce start-period minimum

# policy/healthcheck.rego
package docker

deny[msg] {
  input.Healthcheck.Test
  not input.Healthcheck.StartPeriod
  msg := "HEALTHCHECK must define --start-period to prevent false failures during app startup"
}

deny[msg] {
  input.Healthcheck.Interval < 15000000000  # 15s in nanoseconds
  msg := "HEALTHCHECK --interval must be >= 15s to avoid restart storms"
}

conftest test docker-image-config.json --policy policy/

4. Integration test — assert container reaches healthy state in CI

#!/bin/bash
docker build -t myapp:test .
docker run -d --name test_container myapp:test

# Poll for healthy status, fail pipeline if it never reaches healthy
for i in $(seq 1 12); do
  STATUS=$(docker inspect --format='{{.State.Health.Status}}' test_container)
  [ "$STATUS" = "healthy" ] && echo "Container healthy" && exit 0
  echo "Attempt $i: status=$STATUS"
  sleep 5
done

echo "FATAL: Container never reached healthy state"
docker logs test_container
docker inspect test_container | jq '.State.Health'
exit 1