Initializing Enclave...

How to Fix Nginx 'connect() failed (111: Connection refused)' After Backend Container Restart

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins


TL;DR

  • What broke: Nginx tried to proxy to 127.0.0.1:3000 milliseconds after the backend container restarted — the app process was not yet bound to the port, so the kernel returned ECONNREFUSED (111).
  • How to fix it: Add proxy_next_upstream error timeout retry logic in Nginx, enforce a healthcheck + depends_on: condition: service_healthy in docker-compose, and set a startup grace period in your process supervisor.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your Nginx upstream block and docker-compose service definition without sending your config to any external server.

The Incident (What does the error mean?)

Raw log entry from /var/log/nginx/error.log:

2024/01/15 03:42:17 [error] 29#29: *1482 connect() failed (111: Connection refused)
while connecting to upstream, client: 10.0.1.45, server: api.example.com,
request: "POST /api/v2/orders HTTP/1.1", upstream: "http://127.0.0.1:3000/api/v2/orders",
host: "api.example.com"

errno 111 = ECONNREFUSED. The TCP SYN packet hit the loopback interface, the kernel found no process listening on port 3000, and immediately returned a RST. Nginx had no valid upstream to hand the request to and returned HTTP 502 Bad Gateway to every client.

This is not a network partition. This is a race condition: your orchestrator (Docker, Kubernetes, systemd) marked the container/service as running the instant the entrypoint process forked — before the Node.js/Python/Go application inside completed its initialization, ran DB migrations, and called listen(3000).


The Attack Vector / Blast Radius

This failure mode is deceptively catastrophic in production:

  1. Zero-downtime deploys become zero-uptime deploys. A rolling restart with docker-compose up -d --no-deps backend will cause a 502 window on every single deploy if Nginx has no retry logic.
  2. Thundering herd on recovery. Once the backend finally binds, Nginx's upstream buffer is empty. All queued client connections hit simultaneously. If your app has a slow DB connection pool warm-up, this second wave can OOM-kill the freshly started container.
  3. Health check bypass. If you rely solely on Nginx's passive health checking (max_fails / fail_timeout), the upstream is only marked down after real user traffic has already received 502s. There is no proactive detection.
  4. Silent data loss. POST/PUT requests that hit during the restart window are dropped. Nginx does not retry non-idempotent methods by default. Your order, payment, or write operation is silently discarded.
  5. Cascading alert storm. Monitoring systems fire simultaneously on 5xx rate, upstream latency, and pod restart count — obscuring the single root cause and slowing MTTR.

How to Fix It (The Solution)

Root Cause Checklist

Before applying any fix, confirm which layer is broken:

  • curl -v http://127.0.0.1:3000/health from inside the Nginx container → if it fails, the app is not up yet.
  • ss -tlnp | grep 3000 inside the backend container → confirms whether the port is bound.
  • Check docker inspect --format='{{.State.Health.Status}}' <backend_container> → if none, you have no healthcheck defined.

Basic Fix — Nginx Upstream Retry + Keepalive

# /etc/nginx/conf.d/upstream.conf

  upstream backend_pool {
-     server 127.0.0.1:3000;
+     server 127.0.0.1:3000 max_fails=3 fail_timeout=10s;
+     keepalive 32;
  }

  server {
      listen 80;
      server_name api.example.com;

      location / {
          proxy_pass http://backend_pool;
-
+         proxy_next_upstream error timeout http_502 http_503;
+         proxy_next_upstream_tries 3;
+         proxy_next_upstream_timeout 10s;
+         proxy_connect_timeout 2s;
+         proxy_read_timeout 30s;
+         proxy_http_version 1.1;
+         proxy_set_header Connection "";
      }
  }

What this does: proxy_next_upstream tells Nginx to retry the request on the next available upstream peer on connection failure or 502/503. proxy_connect_timeout 2s prevents Nginx from hanging 60 seconds on a dead socket. keepalive 32 reuses TCP connections, eliminating per-request handshake overhead.

⚠️ proxy_next_upstream only retries idempotent requests by default (GET, HEAD). For POST safety, you must explicitly add non_idempotent — but understand the implications: this can cause duplicate writes if your backend is not idempotent.


Enterprise Best Practice — Docker Compose Healthcheck + Startup Dependency

# docker-compose.yml

  services:
    backend:
      image: myapp:latest
      ports:
        - "3000:3000"
+     healthcheck:
+       test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
+       interval: 5s
+       timeout: 3s
+       retries: 5
+       start_period: 15s

    nginx:
      image: nginx:1.25-alpine
      ports:
        - "80:80"
        - "443:443"
      depends_on:
-       - backend
+       backend:
+         condition: service_healthy
      volumes:
        - ./nginx.conf:/etc/nginx/nginx.conf:ro
+     restart: on-failure:3

What this does: condition: service_healthy blocks the Nginx container from starting until the backend's /health endpoint returns HTTP 200 five consecutive times. start_period: 15s gives the app a grace window during which failed health checks do not count toward retries — critical for apps with slow DB migration on startup.


Enterprise Best Practice — Kubernetes (if applicable)

# deployment.yaml — backend container spec

  containers:
  - name: backend
    image: myapp:latest
+   readinessProbe:
+     httpGet:
+       path: /health
+       port: 3000
+     initialDelaySeconds: 10
+     periodSeconds: 5
+     failureThreshold: 3
+   livenessProbe:
+     httpGet:
+       path: /health
+       port: 3000
+     initialDelaySeconds: 30
+     periodSeconds: 10
+     failureThreshold: 3
+   startupProbe:
+     httpGet:
+       path: /health
+       port: 3000
+     failureThreshold: 30
+     periodSeconds: 3

Kubernetes will not add a pod to the Service endpoints (and therefore Nginx's upstream pool) until the readinessProbe passes. The startupProbe gives slow-starting apps up to 90 seconds (30 * 3s) before liveness checks begin — preventing premature kill loops.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint Nginx Config in Pipeline

# .github/workflows/nginx-lint.yml
- name: Validate Nginx Config
  run: |
    docker run --rm \
      -v $(pwd)/nginx:/etc/nginx:ro \
      nginx:1.25-alpine nginx -t

This catches syntax errors before deploy. It does not catch logical misconfigs like missing proxy_next_upstream.

2. Enforce Upstream Health Directives with OPA/Conftest

# policy/nginx_upstream.rego
package nginx

deny[msg] {
  input.upstreams[_].servers[s]
  not input.upstreams[_].servers[s].max_fails
  msg := sprintf("Upstream server '%v' missing max_fails directive", [s])
}

deny[msg] {
  input.locations[_]
  not input.locations[_].proxy_next_upstream
  msg := "Location block missing proxy_next_upstream directive"
}

Run with conftest test nginx.conf --policy policy/ in your CI pipeline.

3. Integration Test: Simulate Restart Race Condition

#!/bin/bash
# ci/test-restart-resilience.sh
docker-compose up -d
sleep 5

# Simulate backend restart while sending traffic
docker-compose restart backend &
for i in $(seq 1 20); do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
  if [ "$STATUS" == "502" ]; then
    echo "FAIL: Got 502 during restart at request $i"
    exit 1
  fi
  sleep 0.5
done
echo "PASS: No 502s during backend restart window"

Add this to your integration test suite. A 502 during the restart loop means your healthcheck grace period or Nginx retry config is insufficient.

4. Checkov for Docker Compose Healthcheck Enforcement

pip install checkov
checkov -f docker-compose.yml --check CKV_DOCKER_2
# CKV_DOCKER_2: Ensure that HEALTHCHECK instructions have been added

Add checkov to your pre-commit hooks and CI pipeline to catch missing healthchecks before they reach staging.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →