Does `keepalive 32` in the upstream block mean Nginx opens 32 connections to the backend?

No. `keepalive 32` sets the maximum number of *idle* keepalive connections Nginx will cache *per worker process* after a request completes. It does not cap total concurrent connections. With 8 worker processes and `keepalive 32`, you can have up to 256 idle cached connections sitting open to your upstream at any time. Active in-flight connections are unlimited by this directive.

Why does my 504 only happen after periods of low traffic, not during peak load?

Classic keepalive timeout mismatch. During idle periods, your backend server (Gunicorn, uWSGI, Node, etc.) closes connections that have been idle too long. Nginx doesn't know this and keeps those dead sockets in its keepalive pool. The next request after idle gets handed a closed socket, Nginx attempts to write to it, gets a reset or timeout, and returns 504. Fix: set `keepalive_timeout` in your upstream block to a value strictly lower than your backend's idle connection timeout.

Should I increase `keepalive 32` to a higher number to fix the timeouts?

Not without profiling first. Blindly increasing keepalive pool size can worsen upstream saturation by holding more connections open against an already-struggling backend. The correct sequence is: (1) confirm `proxy_http_version 1.1` and `proxy_set_header Connection ""` are set so keepalive is actually active, (2) align `keepalive_timeout` to be below backend idle timeout, (3) tune `proxy_read_timeout` to match real p99 backend latency. Only increase the keepalive pool number if connection reuse metrics show pool exhaustion under normal load.

How to Fix Nginx 504 Gateway Timeout: Debugging 'upstream timed out' with keepalive 32 in Upstream Block

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Nginx is recycling keepalive connections from the pool of 32 faster than backends can respond, or proxy_read_timeout is too short for your upstream's actual response latency — connections die mid-flight with upstream timed out (110: Connection timed out).
How to fix it: Tune keepalive, keepalive_requests, keepalive_timeout, and proxy_read_timeout to match real backend SLAs. Ensure proxy_http_version 1.1 and correct Connection header clearing are set — without these, keepalive is silently disabled.
Shortcut: Use our Client-Side Sandbox above to paste your upstream block and auto-generate the corrected config without sending your internal hostnames anywhere.

The Incident (What Does the Error Mean?)

Your Nginx error log is printing something like this:

2024/01/15 03:42:17 [error] 31#31: *10423 upstream timed out (110: Connection timed out)
while reading response header from upstream, client: 10.0.1.45,
server: api.internal, request: "POST /api/v2/process HTTP/1.1",
upstream: "http://10.0.2.11:8080/api/v2/process",
host: "api.internal"

Immediate consequence: Nginx held a keepalive socket from the pool open to the upstream worker, waited for a response header, and the upstream never responded within the timeout window. Nginx closes the connection and returns HTTP 504 to the client. With keepalive 32, you have 32 idle connections cached per worker process — if upstream workers are saturated or slow, those cached connections are returning dead sockets and Nginx is burning retries before failing.

The Attack Vector / Blast Radius

This is a cascading saturation failure, not an isolated timeout. Here's the kill chain:

Upstream worker pool saturates (GC pause, DB lock, thread starvation — pick your poison). Response times spike from 200ms to 8s.
Nginx's proxy_read_timeout (default: 60s) hasn't fired yet, so connections queue.
The keepalive pool of 32 connections per worker fills. New requests can't reuse idle sockets — they either queue or open new connections, hammering the upstream further.
keepalive_requests default (100 in older Nginx, 1000 in 1.19.10+) causes sockets to be forcibly closed mid-burst, generating spurious resets the upstream interprets as client abandonment.
If you have multiple Nginx workers (e.g., worker_processes 8), you have up to 256 half-open connections piling onto a struggling upstream. The upstream dies. Everything 504s.

The keepalive 32 line itself is not wrong — it's the missing companion directives that make it lethal.

How to Fix It

Root Cause Checklist (Run These First)

# Check if upstream is actually responding slowly
curl -o /dev/null -s -w "time_starttransfer: %{time_starttransfer}\n" http://upstream-host:8080/healthz

# Check Nginx worker connection states
ss -s
netstat -an | grep :8080 | awk '{print $6}' | sort | uniq -c

# Tail upstream timeout errors with rate
tail -f /var/log/nginx/error.log | grep 'upstream timed out'

If time_starttransfer exceeds your proxy_read_timeout, that's your primary fix target — the backend is the bottleneck, not Nginx.

Basic Fix

The minimum viable correction: add the required HTTP/1.1 directives that actually activate keepalive, and align timeouts to reality.

upstream backend_pool {
    server 10.0.2.11:8080;
    server 10.0.2.12:8080;
    keepalive 32;
+   keepalive_requests 1000;
+   keepalive_timeout 65s;
}

server {
    location /api/ {
        proxy_pass http://backend_pool;
+       proxy_http_version 1.1;
+       proxy_set_header Connection "";
-       # proxy_read_timeout was absent (defaulting to 60s)
+       proxy_read_timeout 120s;
+       proxy_connect_timeout 10s;
+       proxy_send_timeout 120s;
    }
}

Critical: Without proxy_http_version 1.1 and proxy_set_header Connection "", Nginx sends Connection: close on every proxied request. The upstream tears down the socket immediately. Your keepalive 32 pool is completely inert — you're paying the TCP handshake cost on every single request while thinking you have connection pooling.

Enterprise Best Practice

For production systems handling variable load with SLA requirements:

upstream backend_pool {
    zone backend_pool 64k;          # Required for dynamic reconfiguration + health checks
    server 10.0.2.11:8080 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.2.12:8080 weight=1 max_fails=3 fail_timeout=30s;
+   server 10.0.2.13:8080 backup;  # Failover target

-   keepalive 32;
+   keepalive 64;                   # Scale to (worker_processes * keepalive) < upstream max_conn
+   keepalive_requests 10000;       # Prevent premature socket recycling under burst
+   keepalive_timeout 75s;          # Must be LESS than upstream idle timeout (e.g., Gunicorn keepalive)
}

server {
    location /api/ {
        proxy_pass          http://backend_pool;
+       proxy_http_version  1.1;
+       proxy_set_header    Connection      "";
+       proxy_set_header    Host            $host;
+       proxy_set_header    X-Real-IP       $remote_addr;

        # Timeouts tuned to p99 backend latency + 20% buffer
-       # (no explicit timeouts set)
+       proxy_connect_timeout   5s;
+       proxy_send_timeout      60s;
+       proxy_read_timeout      90s;

        # Buffer tuning to prevent upstream blocking on slow clients
+       proxy_buffering         on;
+       proxy_buffer_size       16k;
+       proxy_buffers           8 16k;
+       proxy_busy_buffers_size 32k;

        # Circuit-breaker behavior
+       proxy_next_upstream     error timeout http_502 http_503 http_504;
+       proxy_next_upstream_tries   2;
+       proxy_next_upstream_timeout 10s;
    }
}

Key constraint: keepalive_timeout in the upstream block must be set lower than the idle connection timeout on your backend server. If Gunicorn, uWSGI, or Node.js closes idle connections after 60s and your Nginx keepalive_timeout is 75s, Nginx will hand you a dead socket from the pool and you'll get exactly this 504 error on the first request after idle. This is the single most common misconfiguration in this failure class.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private IPs. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing Nginx config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Lint Nginx Configs in Your Pipeline

# gixy — static security and misconfiguration analyzer for Nginx
pip install gixy
gixy /etc/nginx/nginx.conf
# Catches: missing proxy_http_version, SSRF vectors, open redirects

2. Automated Config Validation Gate

# .github/workflows/nginx-lint.yml
jobs:
  nginx-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate Nginx syntax
        run: docker run --rm -v $PWD/nginx:/etc/nginx nginx nginx -t
      - name: Run gixy analysis
        run: docker run --rm -v $PWD/nginx:/etc/nginx yandex/gixy /etc/nginx/nginx.conf

3. OPA/Conftest Policy for Keepalive Hygiene

# policy/nginx_keepalive.rego
package nginx

deny[msg] {
    input.upstream[_].keepalive
    not input.server.location[_].proxy_http_version == "1.1"
    msg := "FATAL: keepalive in upstream block requires proxy_http_version 1.1 in location block"
}

deny[msg] {
    input.upstream[_].keepalive
    not input.server.location[_].proxy_set_header_connection == ""
    msg := "FATAL: keepalive requires proxy_set_header Connection empty string"
}

4. Monitor the Right Metrics

Set alerts on these before you get paged at 3am:

# Prometheus Nginx exporter — alert on upstream 504 rate
- alert: NginxUpstream504Spike
  expr: rate(nginx_upstream_responses_total{status="5xx"}[5m]) > 0.05
  for: 2m
  annotations:
    summary: "Upstream 504 rate exceeding 5% — check keepalive pool and proxy_read_timeout"