Fixing Nginx proxy_next_upstream_tries 0: Why Upstream Retries Are Silently Disabled and How to Restore Failover
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5 mins
TL;DR
- What broke:
proxy_next_upstream_tries 0tells Nginx to make zero retry attempts on upstream failure — one bad backend response kills the request immediately. - How to fix it: Set
proxy_next_upstream_triesto a value ≥ 1 (typically equal to your upstream pool size) and pair it with a correctproxy_next_upstreamerror condition list. - Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your upstream block and get a corrected config without sending your IPs or secrets anywhere.
The Incident (What Does the Error Mean?)
You're seeing requests fail hard with 502 Bad Gateway or 504 Gateway Timeout even though healthy upstream nodes exist in your pool. The log trace looks like this:
2024/01/15 03:42:11 [error] 1234#1234: *9821 connect() failed (111: Connection refused)
while connecting to upstream, client: 10.0.0.5, server: api.internal,
request: "GET /health HTTP/1.1", upstream: "http://10.0.1.22:8080/health",
host: "api.internal"
Nginx hit one upstream, it refused the connection, and stopped there. No retry. No failover. Request dead.
The culprit in your config:
proxy_next_upstream_tries 0;
Per the Nginx docs, a value of 0 means unlimited retries are NOT enabled — it actually means zero tries are permitted, effectively disabling the retry mechanism entirely. This is a classic misread of the directive semantics.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a security exploit — but the blast radius in production is severe:
- Single upstream pod restart (routine K8s rolling deploy) → Nginx hits the restarting pod →
proxy_next_upstream_tries 0→ no retry to the 4 healthy pods → 100% of in-flight requests to that upstream fail. - Health check pass-through illusion: Your load balancer health checks may still show green because Nginx itself is healthy. The upstream failure is invisible at the LB layer until error rates spike in APM.
- Thundering herd amplification: If this is behind a retry-capable client (gRPC, Axios with retry), clients will immediately retry at the application layer, hammering the already-struggling upstream pool instead of letting Nginx absorb the retry internally.
- Zero-downtime deploy becomes a lie: Any deployment strategy relying on Nginx upstream failover (blue/green, canary) silently breaks with this setting. Canary rollback does nothing — Nginx won't retry to the stable upstream.
How to Fix It (The Solution)
Basic Fix
Set proxy_next_upstream_tries to match your upstream pool size. For a 3-node pool, set it to 3.
upstream backend_pool {
server 10.0.1.10:8080;
server 10.0.1.11:8080;
server 10.0.1.12:8080;
}
server {
location /api/ {
proxy_pass http://backend_pool;
- proxy_next_upstream_tries 0;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream error timeout http_502 http_503 http_504;
+ proxy_next_upstream_timeout 10s;
}
}
Enterprise Best Practice
In high-throughput environments, combine retry limits with per-upstream failure tracking using max_fails and fail_timeout to prevent Nginx from repeatedly routing to a known-bad upstream:
upstream backend_pool {
- server 10.0.1.10:8080;
- server 10.0.1.11:8080;
- server 10.0.1.12:8080;
+ server 10.0.1.10:8080 max_fails=2 fail_timeout=10s;
+ server 10.0.1.11:8080 max_fails=2 fail_timeout=10s;
+ server 10.0.1.12:8080 max_fails=2 fail_timeout=10s;
+ keepalive 32;
}
server {
location /api/ {
proxy_pass http://backend_pool;
- proxy_next_upstream_tries 0;
+ proxy_next_upstream error timeout non_idempotent http_502 http_503 http_504;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream_timeout 15s;
+ proxy_connect_timeout 3s;
+ proxy_read_timeout 30s;
}
}
Key decisions here:
non_idempotentis explicitly added — by default Nginx won't retry POST/PATCH on error. Add this only if your upstream handlers are idempotent or you've confirmed safe retry semantics.proxy_next_upstream_timeout 15scaps the total time Nginx will spend retrying across all tries, preventing retry storms from holding connections open indefinitely.max_fails=2 fail_timeout=10son each server means after 2 consecutive failures, Nginx marks that peer unavailable for 10 seconds, soproxy_next_upstream_triesskips it immediately on subsequent requests.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
This class of misconfiguration is fully preventable at the pipeline layer.
1. nginx -t is not enough. nginx -t validates syntax, not semantics. proxy_next_upstream_tries 0 passes -t cleanly.
2. Use gixy — Nginx static security/config analyzer:
pip install gixy
gixy /etc/nginx/nginx.conf
Write a custom gixy plugin or grep rule for your pipeline:
# Fail CI if proxy_next_upstream_tries is explicitly 0
grep -rn 'proxy_next_upstream_tries\s\+0' /etc/nginx/ && \
echo "FATAL: proxy_next_upstream_tries 0 detected" && exit 1
3. OPA/Conftest policy for Nginx configs (if using Helm/Kustomize with Nginx ingress):
package nginx.upstream
deny[msg] {
input.proxy_next_upstream_tries == 0
msg := "proxy_next_upstream_tries must not be 0; upstream retry is disabled."
}
4. Checkov custom check if your Nginx config is managed via Terraform templatefile():
- Write a
CKV_CUSTOM_NGINX_001check that parses the rendered template output and assertsproxy_next_upstream_triesis absent (defaults to 1) or explicitly ≥ 1.
5. Integration test with chaos: In staging, inject a tc netem network fault or kill -9 one upstream pod during load test. Assert via your APM that error rate stays below SLO threshold. If proxy_next_upstream_tries 0 sneaks back in, this test catches it before production.