How to Fix Nginx 'connect() failed (111: Connection refused)' After Backend Container Restart
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins
TL;DR
- What broke: Nginx tried to proxy to
127.0.0.1:3000milliseconds after the backend container restarted — the app process was not yet bound to the port, so the kernel returnedECONNREFUSED (111). - How to fix it: Add
proxy_next_upstream error timeoutretry logic in Nginx, enforce ahealthcheck+depends_on: condition: service_healthyin docker-compose, and set a startup grace period in your process supervisor. - Fast path: Use our Client-Side Sandbox below to auto-refactor your Nginx upstream block and docker-compose service definition without sending your config to any external server.
The Incident (What does the error mean?)
Raw log entry from /var/log/nginx/error.log:
2024/01/15 03:42:17 [error] 29#29: *1482 connect() failed (111: Connection refused)
while connecting to upstream, client: 10.0.1.45, server: api.example.com,
request: "POST /api/v2/orders HTTP/1.1", upstream: "http://127.0.0.1:3000/api/v2/orders",
host: "api.example.com"
errno 111 = ECONNREFUSED. The TCP SYN packet hit the loopback interface, the kernel found no process listening on port 3000, and immediately returned a RST. Nginx had no valid upstream to hand the request to and returned HTTP 502 Bad Gateway to every client.
This is not a network partition. This is a race condition: your orchestrator (Docker, Kubernetes, systemd) marked the container/service as running the instant the entrypoint process forked — before the Node.js/Python/Go application inside completed its initialization, ran DB migrations, and called listen(3000).
The Attack Vector / Blast Radius
This failure mode is deceptively catastrophic in production:
- Zero-downtime deploys become zero-uptime deploys. A rolling restart with
docker-compose up -d --no-deps backendwill cause a 502 window on every single deploy if Nginx has no retry logic. - Thundering herd on recovery. Once the backend finally binds, Nginx's upstream buffer is empty. All queued client connections hit simultaneously. If your app has a slow DB connection pool warm-up, this second wave can OOM-kill the freshly started container.
- Health check bypass. If you rely solely on Nginx's passive health checking (
max_fails/fail_timeout), the upstream is only marked down after real user traffic has already received 502s. There is no proactive detection. - Silent data loss. POST/PUT requests that hit during the restart window are dropped. Nginx does not retry non-idempotent methods by default. Your order, payment, or write operation is silently discarded.
- Cascading alert storm. Monitoring systems fire simultaneously on 5xx rate, upstream latency, and pod restart count — obscuring the single root cause and slowing MTTR.
How to Fix It (The Solution)
Root Cause Checklist
Before applying any fix, confirm which layer is broken:
curl -v http://127.0.0.1:3000/healthfrom inside the Nginx container → if it fails, the app is not up yet.ss -tlnp | grep 3000inside the backend container → confirms whether the port is bound.- Check
docker inspect --format='{{.State.Health.Status}}' <backend_container>→ ifnone, you have no healthcheck defined.
Basic Fix — Nginx Upstream Retry + Keepalive
# /etc/nginx/conf.d/upstream.conf
upstream backend_pool {
- server 127.0.0.1:3000;
+ server 127.0.0.1:3000 max_fails=3 fail_timeout=10s;
+ keepalive 32;
}
server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://backend_pool;
-
+ proxy_next_upstream error timeout http_502 http_503;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream_timeout 10s;
+ proxy_connect_timeout 2s;
+ proxy_read_timeout 30s;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
}
}
What this does: proxy_next_upstream tells Nginx to retry the request on the next available upstream peer on connection failure or 502/503. proxy_connect_timeout 2s prevents Nginx from hanging 60 seconds on a dead socket. keepalive 32 reuses TCP connections, eliminating per-request handshake overhead.
⚠️
proxy_next_upstreamonly retries idempotent requests by default (GET, HEAD). For POST safety, you must explicitly addnon_idempotent— but understand the implications: this can cause duplicate writes if your backend is not idempotent.
Enterprise Best Practice — Docker Compose Healthcheck + Startup Dependency
# docker-compose.yml
services:
backend:
image: myapp:latest
ports:
- "3000:3000"
+ healthcheck:
+ test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
+ interval: 5s
+ timeout: 3s
+ retries: 5
+ start_period: 15s
nginx:
image: nginx:1.25-alpine
ports:
- "80:80"
- "443:443"
depends_on:
- - backend
+ backend:
+ condition: service_healthy
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
+ restart: on-failure:3
What this does: condition: service_healthy blocks the Nginx container from starting until the backend's /health endpoint returns HTTP 200 five consecutive times. start_period: 15s gives the app a grace window during which failed health checks do not count toward retries — critical for apps with slow DB migration on startup.
Enterprise Best Practice — Kubernetes (if applicable)
# deployment.yaml — backend container spec
containers:
- name: backend
image: myapp:latest
+ readinessProbe:
+ httpGet:
+ path: /health
+ port: 3000
+ initialDelaySeconds: 10
+ periodSeconds: 5
+ failureThreshold: 3
+ livenessProbe:
+ httpGet:
+ path: /health
+ port: 3000
+ initialDelaySeconds: 30
+ periodSeconds: 10
+ failureThreshold: 3
+ startupProbe:
+ httpGet:
+ path: /health
+ port: 3000
+ failureThreshold: 30
+ periodSeconds: 3
Kubernetes will not add a pod to the Service endpoints (and therefore Nginx's upstream pool) until the readinessProbe passes. The startupProbe gives slow-starting apps up to 90 seconds (30 * 3s) before liveness checks begin — preventing premature kill loops.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint Nginx Config in Pipeline
# .github/workflows/nginx-lint.yml
- name: Validate Nginx Config
run: |
docker run --rm \
-v $(pwd)/nginx:/etc/nginx:ro \
nginx:1.25-alpine nginx -t
This catches syntax errors before deploy. It does not catch logical misconfigs like missing proxy_next_upstream.
2. Enforce Upstream Health Directives with OPA/Conftest
# policy/nginx_upstream.rego
package nginx
deny[msg] {
input.upstreams[_].servers[s]
not input.upstreams[_].servers[s].max_fails
msg := sprintf("Upstream server '%v' missing max_fails directive", [s])
}
deny[msg] {
input.locations[_]
not input.locations[_].proxy_next_upstream
msg := "Location block missing proxy_next_upstream directive"
}
Run with conftest test nginx.conf --policy policy/ in your CI pipeline.
3. Integration Test: Simulate Restart Race Condition
#!/bin/bash
# ci/test-restart-resilience.sh
docker-compose up -d
sleep 5
# Simulate backend restart while sending traffic
docker-compose restart backend &
for i in $(seq 1 20); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
if [ "$STATUS" == "502" ]; then
echo "FAIL: Got 502 during restart at request $i"
exit 1
fi
sleep 0.5
done
echo "PASS: No 502s during backend restart window"
Add this to your integration test suite. A 502 during the restart loop means your healthcheck grace period or Nginx retry config is insufficient.
4. Checkov for Docker Compose Healthcheck Enforcement
pip install checkov
checkov -f docker-compose.yml --check CKV_DOCKER_2
# CKV_DOCKER_2: Ensure that HEALTHCHECK instructions have been added
Add checkov to your pre-commit hooks and CI pipeline to catch missing healthchecks before they reach staging.