Initializing Enclave...

Fix Nginx 502 Bad Gateway: readv() failed (104: Connection reset by peer) KeepAlive Timeout Mismatch

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins


TL;DR

  • What broke: Nginx is reusing a keepalive connection to an upstream (Node, Gunicorn, uWSGI, etc.) that the upstream already silently closed — Nginx reads on a dead socket and gets a TCP RST (errno 104), returning a 502 to the client.
  • How to fix it: Set keepalive_timeout on the upstream server higher than Nginx's keepalive_timeout, enable proxy_http_version 1.1, and tune keepalive pool size in the upstream {} block.
  • Fastest path: Use our Client-Side Sandbox below to auto-refactor this — paste your nginx.conf and get a corrected diff without sending your config to any external server.

The Incident (What Does This Error Mean?)

You're seeing this in your Nginx error log:

2024/01/15 03:42:17 [error] 31#31: *18423 readv() failed (104: Connection reset by peer)
  while reading response header from upstream,
  client: 10.0.1.45, server: api.internal,
  request: "POST /api/v2/process HTTP/1.1",
  upstream: "http://127.0.0.1:8000/api/v2/process",
  host: "api.internal"

What just happened, precisely:

  1. Nginx pulled an idle connection from its upstream keepalive pool.
  2. It wrote the HTTP request headers to that socket.
  3. It called readv() to read the upstream's response.
  4. The upstream (Gunicorn/uWSGI/Node/Spring) had already sent a FIN or RST on that connection because its own keepalive timeout expired first.
  5. Nginx received TCP error code 104 (ECONNRESET) — connection reset by peer.
  6. Nginx has no valid response to proxy back → 502 Bad Gateway to your client.

Immediate consequence: Every request that lands on a stale pooled connection fails with a 502. Under load, with a large keepalive pool and a short upstream timeout, this can affect 5–30% of requests in a rolling window — enough to trigger SLA breaches and alert storms.


The Attack Vector / Blast Radius

This is a race condition at the TCP layer, not a code bug. The blast radius scales with:

Factor Why It Makes It Worse
High upstream keepalive pool size More stale connections sitting in the pool
Low upstream server keepalive timeout (e.g., Gunicorn default: 2s) Connections go stale faster than Nginx detects
HTTP/1.0 proxying (Nginx default) Connection: close forces reconnect overhead, masking the real fix
Long Nginx keepalive_timeout (default: 75s) Nginx holds connections the upstream already killed
Load balancer health checks not using HTTP/1.1 Stale connections not flushed between checks

Cascading failure scenario: Under sustained traffic, Nginx's keepalive pool fills with stale connections. Every worker that picks a stale connection wastes a request. Upstream gets hammered with new TCP connections to compensate. Connection table exhausts. You go from intermittent 502s to a full outage in under 60 seconds on a high-QPS service.

The core rule you violated (or your framework defaulted to breaking):

Nginx keepalive_timeout MUST be strictly less than the upstream server's keepalive timeout. Always. No exceptions.


How to Fix It

Basic Fix — Nginx upstream {} Block

The minimum viable change: force HTTP/1.1 toward upstream, enable keepalive pooling correctly, and set a safe timeout.

 http {

-    keepalive_timeout 75s;
+    keepalive_timeout 65s;  # Must be < upstream keepalive timeout

     upstream app_backend {
         server 127.0.0.1:8000;
+        keepalive 32;           # Pool size: max idle connections per worker
+        keepalive_timeout 60s;  # Drop idle upstream connections after 60s
+        keepalive_requests 1000;
     }

     server {
         listen 80;

         location / {
             proxy_pass http://app_backend;
-            # Missing HTTP/1.1 — defaults to 1.0 which sends Connection: close
+            proxy_http_version 1.1;
+            proxy_set_header Connection "";
+            proxy_connect_timeout 10s;
+            proxy_read_timeout 60s;
+            proxy_send_timeout 60s;
+            proxy_next_upstream error timeout http_502;
+            proxy_next_upstream_tries 2;
         }
     }
 }

Why proxy_set_header Connection ""? When proxying HTTP/1.1, Nginx must clear the Connection header it received from the client (which may contain keep-alive or close hop-by-hop directives) before forwarding upstream. Failing to do this causes upstreams to close connections unexpectedly.


Enterprise Best Practice — Full Hardened Config

For production: add retry logic, tune per upstream type, and set proxy_next_upstream to recover from stale connections transparently.

 upstream app_backend {
     server 10.0.1.10:8000 weight=5 max_fails=3 fail_timeout=30s;
     server 10.0.1.11:8000 weight=5 max_fails=3 fail_timeout=30s;
+    keepalive 64;
+    keepalive_timeout 55s;   # 55s < Gunicorn's 60s / uWSGI's 60s
+    keepalive_requests 2000;
 }

 server {
     location /api/ {
         proxy_pass         http://app_backend;
+        proxy_http_version 1.1;
+        proxy_set_header   Connection "";
+        proxy_set_header   Host $host;
+        proxy_set_header   X-Real-IP $remote_addr;

-        proxy_read_timeout 90s;  # Too long — masks upstream hangs
+        proxy_read_timeout 30s;
+        proxy_connect_timeout 5s;

         # Retry ONLY on connection-level errors, not on POST bodies already sent
+        proxy_next_upstream error timeout non_idempotent;
+        proxy_next_upstream_tries 2;
+        proxy_next_upstream_timeout 10s;
     }
 }

Upstream server-side fixes (you MUST also do this):

 # Gunicorn (gunicorn.conf.py)
-keepalive = 2       # Default — kills connections after 2 seconds of idle
+keepalive = 65      # Must be > Nginx upstream keepalive_timeout (55s)

 # uWSGI (uwsgi.ini)
-# http-keepalive not set
+http-keepalive = 1
+http-auto-chunked = 1
+add-header = Connection: Keep-Alive
+http-timeout = 65

 # Node.js (http/https server)
-// Default keepAliveTimeout = 5000ms (Node < 18 default)
+server.keepAliveTimeout = 65000;  // 65s — must exceed Nginx's 55s
+server.headersTimeout = 66000;    // Must be > keepAliveTimeout

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint Nginx Configs in Your Pipeline

# Use gixy — Nginx security/config static analyzer
pip install gixy
gixy /etc/nginx/nginx.conf

# Also: nginx -t catches syntax errors before deploy
nginx -t -c /etc/nginx/nginx.conf

2. Enforce Keepalive Rules with a Custom OPA Policy

# opa/nginx_keepalive.rego
package nginx.keepalive

deny[msg] {
    # Fail if upstream block has no keepalive directive
    upstream := input.upstreams[_]
    not upstream.keepalive
    msg := sprintf("Upstream '%v' missing keepalive pool directive", [upstream.name])
}

deny[msg] {
    # Fail if proxy_http_version is not 1.1
    location := input.locations[_]
    location.proxy_http_version != "1.1"
    msg := sprintf("Location '%v' must set proxy_http_version 1.1", [location.path])
}

3. Synthetic Monitoring — Catch Stale Pool Errors Before Users Do

# Prometheus alert — fire before 502 rate crosses SLO
- alert: NginxUpstream502Spike
  expr: |
    rate(nginx_upstream_responses_total{status="502"}[2m]) > 0.01
  for: 1m
  labels:
    severity: page
  annotations:
    summary: "502 rate from upstream exceeds 1% — check keepalive timeout mismatch"

4. Terraform / Ansible — Codify the Timeout Relationship

# variables.tf — enforce the invariant in infrastructure code
variable "upstream_keepalive_timeout" {
  default = 65
  description = "Must be set higher than nginx_upstream_keepalive_timeout"
}

variable "nginx_upstream_keepalive_timeout" {
  default = 55
  validation {
    condition     = var.nginx_upstream_keepalive_timeout < var.upstream_keepalive_timeout
    error_message = "nginx_upstream_keepalive_timeout must be strictly less than upstream_keepalive_timeout to prevent readv() 104 errors."
  }
}

The invariant to encode everywhere:

upstream_server_keepalive_timeout
  > nginx_upstream_keepalive_timeout (in upstream {} block)
    > nginx_keepalive_timeout (global http {} block)

Burn this into your runbook. Every time you add a new upstream type (Lambda via ALB, gRPC, FastAPI), verify this chain holds.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →