Initializing Enclave...

How to Fix Nginx 'upstream timed out (110: Connection timed out)' While Reading Response Header

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause layer

TL;DR

  • What broke: Nginx waited 60 seconds for the first byte of an HTTP response header from http://backend:8080 and got nothing. The OS returned ETIMEDOUT (110), Nginx logged the error, and your users got a 504 Gateway Timeout.
  • How to fix it: Identify whether the timeout is in Nginx config (too short), the backend app (slow query, thread starvation, GC pause), or the network layer (DNS, container networking, firewall). Then tune proxy_read_timeout, fix the upstream bottleneck, and add upstream health checks.
  • Shortcut: Use our Client-Side Sandbox above to paste your Nginx config — it will auto-diagnose the timeout chain and output a refactored config with correct timeout values and health check directives.

The Incident (What Does This Error Mean?)

Raw log output:

2024/01/15 03:47:22 [error] 31#31: *18423 upstream timed out (110: Connection timed out)
while reading response header from upstream 'http://backend:8080/api/v2/process',
client: 10.0.1.45, server: api.example.com, request: "POST /api/v2/process HTTP/1.1",
upstream: "http://172.17.0.3:8080/api/v2/process", host: "api.example.com"

What this means precisely: Nginx successfully opened a TCP connection to the backend and forwarded the full request. The backend accepted the connection and received the data — but never sent back even the first byte of the HTTP response header within the proxy_read_timeout window (default or configured: 60s). This is not a connection refusal. The backend is alive but not responding in time.

Immediate consequence: Nginx closes the upstream socket, returns HTTP 504 Gateway Timeout to the client, and logs this error. If you have multiple upstream workers in a pool, Nginx may retry on the next upstream — multiplying backend load during an already-degraded state.


The Attack Vector / Blast Radius

This is a cascading failure scenario, not a one-off timeout. Here is the failure chain:

  1. Backend thread/goroutine pool exhaustion: One slow database query holds a thread. New requests queue. Queue depth exceeds timeout window. All requests start timing out. Nginx retries amplify backend load by proxy_next_upstream factor. You now have a thundering herd.

  2. Connection pool starvation: If your backend uses a DB connection pool (HikariCP, pgBouncer, etc.) and pool is saturated, every new request blocks waiting for a connection. 60 seconds passes. Nginx kills it. The backend thread is still blocked — it doesn't know Nginx gave up. You now have zombie threads holding DB connections.

  3. Nginx worker process saturation: Each timed-out request held an Nginx worker connection for the full 60s. At 1024 worker connections and 60s timeout, you can sustain only ~17 req/s before Nginx itself saturates. Your proxy layer becomes the bottleneck, not just the backend.

  4. Security angle — slowloris-variant against your own stack: A misconfigured proxy_read_timeout that is too long (e.g., 300s) means a slow backend (or a deliberately slow attacker-controlled internal service in a compromised environment) can hold Nginx workers open for 5 minutes each. This is effectively a resource exhaustion vector from within your own cluster.

Blast radius: Full service degradation within 2–3 minutes of initial backend slowdown. All downstream clients see 504s. Upstream retries amplify backend load. Recovery requires backend restart AND Nginx connection drain.


How to Fix It (The Solution)

Step 0: Confirm Which Layer Is Timing Out

Before touching config, run this from inside the Nginx container/pod:

# Test raw TCP connect time to backend
curl -v --max-time 5 http://backend:8080/healthz

# If curl also hangs, the problem is the backend, not Nginx timeout config
# If curl succeeds fast, the problem is specific to your slow endpoint

# Check backend thread/connection state
curl http://backend:8080/actuator/metrics/jvm.threads.live  # Spring Boot
curl http://backend:8080/debug/pprof/goroutine?debug=1       # Go

Basic Fix — Tune Nginx Timeout Values

Only do this if your backend legitimately needs more time (e.g., long-running export jobs). This is a band-aid, not a fix.

 http {
   upstream backend_pool {
     server backend:8080;
+    keepalive 32;
+    keepalive_requests 1000;
+    keepalive_timeout 75s;
   }

   server {
     location /api/ {
       proxy_pass http://backend_pool;

-      # Default or previously missing — Nginx falls back to 60s
+      proxy_connect_timeout  10s;
+      proxy_send_timeout     60s;
+      proxy_read_timeout    120s;  # Increase ONLY if backend genuinely needs it

+      proxy_http_version 1.1;
+      proxy_set_header Connection "";
     }
   }
 }

Enterprise Best Practice — Upstream Health Checks, Circuit Breaking, and Per-Route Timeouts

 http {
   upstream backend_pool {
     zone backend_zone 64k;
     server backend:8080;
-    # No health checks — dead backends stay in rotation
+    keepalive 32;
+    keepalive_requests 1000;
+    keepalive_timeout 75s;
   }

+  # Separate upstream for long-running async endpoints
+  upstream backend_async_pool {
+    zone backend_async_zone 64k;
+    server backend:8080;
+    keepalive 8;
+  }

   server {
     # Fast API endpoints — strict timeout
     location /api/v2/ {
       proxy_pass         http://backend_pool;
       proxy_http_version 1.1;
       proxy_set_header   Connection "";
       proxy_set_header   Host $host;
       proxy_set_header   X-Real-IP $remote_addr;

-      proxy_read_timeout 60s;
+      proxy_connect_timeout  5s;
+      proxy_send_timeout    30s;
+      proxy_read_timeout    30s;

+      # Circuit-break: don't retry on timeout — avoids thundering herd
+      proxy_next_upstream error timeout http_502 http_503;
+      proxy_next_upstream_tries 2;
+      proxy_next_upstream_timeout 10s;

+      # Return 504 immediately with a clean error body
+      proxy_intercept_errors on;
+      error_page 504 /errors/504.json;
     }

+    # Long-running jobs get their own location with relaxed timeout
+    location /api/v2/export {
+      proxy_pass         http://backend_async_pool;
+      proxy_http_version 1.1;
+      proxy_set_header   Connection "";
+      proxy_connect_timeout   5s;
+      proxy_send_timeout     30s;
+      proxy_read_timeout    300s;  # Explicit, documented, scoped
+      proxy_next_upstream    off;  # No retry for idempotency-unsafe ops
+    }
   }
 }

For Kubernetes (nginx-ingress annotations):

 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
   name: backend-ingress
   annotations:
     nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
-    # Missing — inherits 60s default
+    nginx.ingress.kubernetes.io/proxy-send-timeout: "30"
+    nginx.ingress.kubernetes.io/proxy-read-timeout: "30"
+    nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout"
+    nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "2"
+    # Enable upstream keepalive
+    nginx.ingress.kubernetes.io/upstream-keepalive-connections: "32"
+    nginx.ingress.kubernetes.io/upstream-keepalive-timeout: "75"

Fix the backend — this is the real fix:

 # Spring Boot application.properties
-# No connection pool config — using HikariCP defaults
+spring.datasource.hikari.maximum-pool-size=20
+spring.datasource.hikari.connection-timeout=3000
+spring.datasource.hikari.idle-timeout=300000
+spring.datasource.hikari.max-lifetime=1200000
+# Add slow query logging to find the actual culprit
+spring.datasource.hikari.leak-detection-threshold=30000

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint Nginx Config in Pipeline

# In your Dockerfile or CI step — catch missing timeout directives at build time
nginx -t -c /etc/nginx/nginx.conf

# Use gixy (Nginx static analyzer) for security and config issues
pip install gixy
gixy /etc/nginx/nginx.conf

2. OPA/Conftest Policy — Enforce Timeout Directives

# policy/nginx_timeouts.rego
package nginx

deny[msg] {
  input.http.server.location[_].proxy_read_timeout == null
  msg := "POLICY VIOLATION: proxy_read_timeout must be explicitly set on all proxy locations. Default 60s is insufficient for production."
}

deny[msg] {
  timeout := input.http.server.location[_].proxy_read_timeout
  to_number(trim(timeout, "s")) > 120
  msg := sprintf("POLICY VIOLATION: proxy_read_timeout of %v exceeds 120s maximum. Use a dedicated async upstream for long-running operations.", [timeout])
}

3. Load Test Timeout Boundaries in Staging

# k6 script — validate that 504s appear at expected threshold
import http from 'k6/http';
import { check } from 'k6';

export let options = {
  scenarios: {
    timeout_boundary: {
      executor: 'constant-vus',
      vus: 50,
      duration: '120s',
    },
  },
  thresholds: {
    // Fail the pipeline if more than 1% of requests timeout
    'http_req_failed{status:504}': ['rate<0.01'],
    'http_req_duration{status:200}': ['p(99)<25000'],
  },
};

export default function () {
  let res = http.post('https://staging.api.example.com/api/v2/process',
    JSON.stringify({ payload: 'test' }),
    { headers: { 'Content-Type': 'application/json' }, timeout: '35s' }
  );
  check(res, { 'not a 504': (r) => r.status !== 504 });
}

4. Alerting — Don't Wait for Users to Report It

# Prometheus alerting rule
groups:
  - name: nginx_upstream_timeouts
    rules:
      - alert: NginxUpstreamTimeoutRateHigh
        expr: |
          rate(nginx_upstream_response_time_seconds_count{status="504"}[5m]) /
          rate(nginx_upstream_response_time_seconds_count[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx upstream timeout rate exceeds 5% — backend degraded"
          runbook: "https://wiki.internal/runbooks/nginx-504"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →