Fix Nginx 502 Bad Gateway: readv() failed (104: Connection reset by peer) KeepAlive Timeout Mismatch
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins
TL;DR
- What broke: Nginx is reusing a keepalive connection to an upstream (Node, Gunicorn, uWSGI, etc.) that the upstream already silently closed — Nginx reads on a dead socket and gets a TCP RST (errno 104), returning a 502 to the client.
- How to fix it: Set
keepalive_timeouton the upstream server higher than Nginx'skeepalive_timeout, enableproxy_http_version 1.1, and tunekeepalivepool size in theupstream {}block. - Fastest path: Use our Client-Side Sandbox below to auto-refactor this — paste your
nginx.confand get a corrected diff without sending your config to any external server.
The Incident (What Does This Error Mean?)
You're seeing this in your Nginx error log:
2024/01/15 03:42:17 [error] 31#31: *18423 readv() failed (104: Connection reset by peer)
while reading response header from upstream,
client: 10.0.1.45, server: api.internal,
request: "POST /api/v2/process HTTP/1.1",
upstream: "http://127.0.0.1:8000/api/v2/process",
host: "api.internal"
What just happened, precisely:
- Nginx pulled an idle connection from its upstream keepalive pool.
- It wrote the HTTP request headers to that socket.
- It called
readv()to read the upstream's response. - The upstream (Gunicorn/uWSGI/Node/Spring) had already sent a
FINorRSTon that connection because its own keepalive timeout expired first. - Nginx received TCP error code 104 (
ECONNRESET) — connection reset by peer. - Nginx has no valid response to proxy back → 502 Bad Gateway to your client.
Immediate consequence: Every request that lands on a stale pooled connection fails with a 502. Under load, with a large keepalive pool and a short upstream timeout, this can affect 5–30% of requests in a rolling window — enough to trigger SLA breaches and alert storms.
The Attack Vector / Blast Radius
This is a race condition at the TCP layer, not a code bug. The blast radius scales with:
| Factor | Why It Makes It Worse |
|---|---|
High upstream keepalive pool size |
More stale connections sitting in the pool |
| Low upstream server keepalive timeout (e.g., Gunicorn default: 2s) | Connections go stale faster than Nginx detects |
| HTTP/1.0 proxying (Nginx default) | Connection: close forces reconnect overhead, masking the real fix |
Long Nginx keepalive_timeout (default: 75s) |
Nginx holds connections the upstream already killed |
| Load balancer health checks not using HTTP/1.1 | Stale connections not flushed between checks |
Cascading failure scenario: Under sustained traffic, Nginx's keepalive pool fills with stale connections. Every worker that picks a stale connection wastes a request. Upstream gets hammered with new TCP connections to compensate. Connection table exhausts. You go from intermittent 502s to a full outage in under 60 seconds on a high-QPS service.
The core rule you violated (or your framework defaulted to breaking):
Nginx keepalive_timeout MUST be strictly less than the upstream server's keepalive timeout. Always. No exceptions.
How to Fix It
Basic Fix — Nginx upstream {} Block
The minimum viable change: force HTTP/1.1 toward upstream, enable keepalive pooling correctly, and set a safe timeout.
http {
- keepalive_timeout 75s;
+ keepalive_timeout 65s; # Must be < upstream keepalive timeout
upstream app_backend {
server 127.0.0.1:8000;
+ keepalive 32; # Pool size: max idle connections per worker
+ keepalive_timeout 60s; # Drop idle upstream connections after 60s
+ keepalive_requests 1000;
}
server {
listen 80;
location / {
proxy_pass http://app_backend;
- # Missing HTTP/1.1 — defaults to 1.0 which sends Connection: close
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+ proxy_connect_timeout 10s;
+ proxy_read_timeout 60s;
+ proxy_send_timeout 60s;
+ proxy_next_upstream error timeout http_502;
+ proxy_next_upstream_tries 2;
}
}
}
Why proxy_set_header Connection ""? When proxying HTTP/1.1, Nginx must clear the Connection header it received from the client (which may contain keep-alive or close hop-by-hop directives) before forwarding upstream. Failing to do this causes upstreams to close connections unexpectedly.
Enterprise Best Practice — Full Hardened Config
For production: add retry logic, tune per upstream type, and set proxy_next_upstream to recover from stale connections transparently.
upstream app_backend {
server 10.0.1.10:8000 weight=5 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8000 weight=5 max_fails=3 fail_timeout=30s;
+ keepalive 64;
+ keepalive_timeout 55s; # 55s < Gunicorn's 60s / uWSGI's 60s
+ keepalive_requests 2000;
}
server {
location /api/ {
proxy_pass http://app_backend;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+ proxy_set_header Host $host;
+ proxy_set_header X-Real-IP $remote_addr;
- proxy_read_timeout 90s; # Too long — masks upstream hangs
+ proxy_read_timeout 30s;
+ proxy_connect_timeout 5s;
# Retry ONLY on connection-level errors, not on POST bodies already sent
+ proxy_next_upstream error timeout non_idempotent;
+ proxy_next_upstream_tries 2;
+ proxy_next_upstream_timeout 10s;
}
}
Upstream server-side fixes (you MUST also do this):
# Gunicorn (gunicorn.conf.py)
-keepalive = 2 # Default — kills connections after 2 seconds of idle
+keepalive = 65 # Must be > Nginx upstream keepalive_timeout (55s)
# uWSGI (uwsgi.ini)
-# http-keepalive not set
+http-keepalive = 1
+http-auto-chunked = 1
+add-header = Connection: Keep-Alive
+http-timeout = 65
# Node.js (http/https server)
-// Default keepAliveTimeout = 5000ms (Node < 18 default)
+server.keepAliveTimeout = 65000; // 65s — must exceed Nginx's 55s
+server.headersTimeout = 66000; // Must be > keepAliveTimeout
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint Nginx Configs in Your Pipeline
# Use gixy — Nginx security/config static analyzer
pip install gixy
gixy /etc/nginx/nginx.conf
# Also: nginx -t catches syntax errors before deploy
nginx -t -c /etc/nginx/nginx.conf
2. Enforce Keepalive Rules with a Custom OPA Policy
# opa/nginx_keepalive.rego
package nginx.keepalive
deny[msg] {
# Fail if upstream block has no keepalive directive
upstream := input.upstreams[_]
not upstream.keepalive
msg := sprintf("Upstream '%v' missing keepalive pool directive", [upstream.name])
}
deny[msg] {
# Fail if proxy_http_version is not 1.1
location := input.locations[_]
location.proxy_http_version != "1.1"
msg := sprintf("Location '%v' must set proxy_http_version 1.1", [location.path])
}
3. Synthetic Monitoring — Catch Stale Pool Errors Before Users Do
# Prometheus alert — fire before 502 rate crosses SLO
- alert: NginxUpstream502Spike
expr: |
rate(nginx_upstream_responses_total{status="502"}[2m]) > 0.01
for: 1m
labels:
severity: page
annotations:
summary: "502 rate from upstream exceeds 1% — check keepalive timeout mismatch"
4. Terraform / Ansible — Codify the Timeout Relationship
# variables.tf — enforce the invariant in infrastructure code
variable "upstream_keepalive_timeout" {
default = 65
description = "Must be set higher than nginx_upstream_keepalive_timeout"
}
variable "nginx_upstream_keepalive_timeout" {
default = 55
validation {
condition = var.nginx_upstream_keepalive_timeout < var.upstream_keepalive_timeout
error_message = "nginx_upstream_keepalive_timeout must be strictly less than upstream_keepalive_timeout to prevent readv() 104 errors."
}
}
The invariant to encode everywhere:
upstream_server_keepalive_timeout
> nginx_upstream_keepalive_timeout (in upstream {} block)
> nginx_keepalive_timeout (global http {} block)
Burn this into your runbook. Every time you add a new upstream type (Lambda via ALB, gRPC, FastAPI), verify this chain holds.