How to Fix Nginx 504 Gateway Timeout: Debugging 'upstream timed out' with keepalive 32 in Upstream Block
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Nginx is recycling keepalive connections from the pool of 32 faster than backends can respond, or
proxy_read_timeoutis too short for your upstream's actual response latency — connections die mid-flight withupstream timed out (110: Connection timed out). - How to fix it: Tune
keepalive,keepalive_requests,keepalive_timeout, andproxy_read_timeoutto match real backend SLAs. Ensureproxy_http_version 1.1and correctConnectionheader clearing are set — without these, keepalive is silently disabled. - Shortcut: Use our Client-Side Sandbox above to paste your upstream block and auto-generate the corrected config without sending your internal hostnames anywhere.
The Incident (What Does the Error Mean?)
Your Nginx error log is printing something like this:
2024/01/15 03:42:17 [error] 31#31: *10423 upstream timed out (110: Connection timed out)
while reading response header from upstream, client: 10.0.1.45,
server: api.internal, request: "POST /api/v2/process HTTP/1.1",
upstream: "http://10.0.2.11:8080/api/v2/process",
host: "api.internal"
Immediate consequence: Nginx held a keepalive socket from the pool open to the upstream worker, waited for a response header, and the upstream never responded within the timeout window. Nginx closes the connection and returns HTTP 504 to the client. With keepalive 32, you have 32 idle connections cached per worker process — if upstream workers are saturated or slow, those cached connections are returning dead sockets and Nginx is burning retries before failing.
The Attack Vector / Blast Radius
This is a cascading saturation failure, not an isolated timeout. Here's the kill chain:
- Upstream worker pool saturates (GC pause, DB lock, thread starvation — pick your poison). Response times spike from 200ms to 8s.
- Nginx's
proxy_read_timeout(default: 60s) hasn't fired yet, so connections queue. - The keepalive pool of 32 connections per worker fills. New requests can't reuse idle sockets — they either queue or open new connections, hammering the upstream further.
keepalive_requestsdefault (100 in older Nginx, 1000 in 1.19.10+) causes sockets to be forcibly closed mid-burst, generating spurious resets the upstream interprets as client abandonment.- If you have multiple Nginx workers (e.g.,
worker_processes 8), you have up to 256 half-open connections piling onto a struggling upstream. The upstream dies. Everything 504s.
The keepalive 32 line itself is not wrong — it's the missing companion directives that make it lethal.
How to Fix It
Root Cause Checklist (Run These First)
# Check if upstream is actually responding slowly
curl -o /dev/null -s -w "time_starttransfer: %{time_starttransfer}\n" http://upstream-host:8080/healthz
# Check Nginx worker connection states
ss -s
netstat -an | grep :8080 | awk '{print $6}' | sort | uniq -c
# Tail upstream timeout errors with rate
tail -f /var/log/nginx/error.log | grep 'upstream timed out'
If time_starttransfer exceeds your proxy_read_timeout, that's your primary fix target — the backend is the bottleneck, not Nginx.
Basic Fix
The minimum viable correction: add the required HTTP/1.1 directives that actually activate keepalive, and align timeouts to reality.
upstream backend_pool {
server 10.0.2.11:8080;
server 10.0.2.12:8080;
keepalive 32;
+ keepalive_requests 1000;
+ keepalive_timeout 65s;
}
server {
location /api/ {
proxy_pass http://backend_pool;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
- # proxy_read_timeout was absent (defaulting to 60s)
+ proxy_read_timeout 120s;
+ proxy_connect_timeout 10s;
+ proxy_send_timeout 120s;
}
}
Critical: Without proxy_http_version 1.1 and proxy_set_header Connection "", Nginx sends Connection: close on every proxied request. The upstream tears down the socket immediately. Your keepalive 32 pool is completely inert — you're paying the TCP handshake cost on every single request while thinking you have connection pooling.
Enterprise Best Practice
For production systems handling variable load with SLA requirements:
upstream backend_pool {
zone backend_pool 64k; # Required for dynamic reconfiguration + health checks
server 10.0.2.11:8080 weight=1 max_fails=3 fail_timeout=30s;
server 10.0.2.12:8080 weight=1 max_fails=3 fail_timeout=30s;
+ server 10.0.2.13:8080 backup; # Failover target
- keepalive 32;
+ keepalive 64; # Scale to (worker_processes * keepalive) < upstream max_conn
+ keepalive_requests 10000; # Prevent premature socket recycling under burst
+ keepalive_timeout 75s; # Must be LESS than upstream idle timeout (e.g., Gunicorn keepalive)
}
server {
location /api/ {
proxy_pass http://backend_pool;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+ proxy_set_header Host $host;
+ proxy_set_header X-Real-IP $remote_addr;
# Timeouts tuned to p99 backend latency + 20% buffer
- # (no explicit timeouts set)
+ proxy_connect_timeout 5s;
+ proxy_send_timeout 60s;
+ proxy_read_timeout 90s;
# Buffer tuning to prevent upstream blocking on slow clients
+ proxy_buffering on;
+ proxy_buffer_size 16k;
+ proxy_buffers 8 16k;
+ proxy_busy_buffers_size 32k;
# Circuit-breaker behavior
+ proxy_next_upstream error timeout http_502 http_503 http_504;
+ proxy_next_upstream_tries 2;
+ proxy_next_upstream_timeout 10s;
}
}
Key constraint: keepalive_timeout in the upstream block must be set lower than the idle connection timeout on your backend server. If Gunicorn, uWSGI, or Node.js closes idle connections after 60s and your Nginx keepalive_timeout is 75s, Nginx will hand you a dead socket from the pool and you'll get exactly this 504 error on the first request after idle. This is the single most common misconfiguration in this failure class.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private IPs. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing Nginx config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint Nginx Configs in Your Pipeline
# gixy — static security and misconfiguration analyzer for Nginx
pip install gixy
gixy /etc/nginx/nginx.conf
# Catches: missing proxy_http_version, SSRF vectors, open redirects
2. Automated Config Validation Gate
# .github/workflows/nginx-lint.yml
jobs:
nginx-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate Nginx syntax
run: docker run --rm -v $PWD/nginx:/etc/nginx nginx nginx -t
- name: Run gixy analysis
run: docker run --rm -v $PWD/nginx:/etc/nginx yandex/gixy /etc/nginx/nginx.conf
3. OPA/Conftest Policy for Keepalive Hygiene
# policy/nginx_keepalive.rego
package nginx
deny[msg] {
input.upstream[_].keepalive
not input.server.location[_].proxy_http_version == "1.1"
msg := "FATAL: keepalive in upstream block requires proxy_http_version 1.1 in location block"
}
deny[msg] {
input.upstream[_].keepalive
not input.server.location[_].proxy_set_header_connection == ""
msg := "FATAL: keepalive requires proxy_set_header Connection empty string"
}
4. Monitor the Right Metrics
Set alerts on these before you get paged at 3am:
# Prometheus Nginx exporter — alert on upstream 504 rate
- alert: NginxUpstream504Spike
expr: rate(nginx_upstream_responses_total{status="5xx"}[5m]) > 0.05
for: 2m
annotations:
summary: "Upstream 504 rate exceeding 5% — check keepalive pool and proxy_read_timeout"