Does proxy_next_upstream_tries 0 mean unlimited retries in Nginx?

No. This is the most common misreading of the directive. A value of 0 means zero retry attempts are allowed — the request fails immediately on the first upstream error. Unlimited retries is not a supported mode; you must set an explicit positive integer matching your upstream pool size.

Will proxy_next_upstream_tries retry POST and PUT requests by default?

No. By default, Nginx only retries idempotent methods (GET, HEAD). To enable retries on non-idempotent methods like POST or PATCH, you must explicitly add 'non_idempotent' to the proxy_next_upstream directive. Only do this if your backend handlers are safe to call multiple times for the same request.

What is the difference between proxy_next_upstream_tries and max_fails in Nginx upstream blocks?

They operate at different layers. proxy_next_upstream_tries controls how many upstream servers Nginx will attempt for a single incoming request before returning an error to the client. max_fails controls how many consecutive failures on a specific upstream peer cause Nginx to mark that peer as temporarily unavailable for all subsequent requests during the fail_timeout window. You need both configured correctly for robust failover.

Fixing Nginx proxy_next_upstream_tries 0: Why Upstream Retries Are Silently Disabled and How to Restore Failover

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5 mins

TL;DR

What broke: proxy_next_upstream_tries 0 tells Nginx to make zero retry attempts on upstream failure — one bad backend response kills the request immediately.
How to fix it: Set proxy_next_upstream_tries to a value ≥ 1 (typically equal to your upstream pool size) and pair it with a correct proxy_next_upstream error condition list.
Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your upstream block and get a corrected config without sending your IPs or secrets anywhere.

The Incident (What Does the Error Mean?)

You're seeing requests fail hard with 502 Bad Gateway or 504 Gateway Timeout even though healthy upstream nodes exist in your pool. The log trace looks like this:

2024/01/15 03:42:11 [error] 1234#1234: *9821 connect() failed (111: Connection refused)
  while connecting to upstream, client: 10.0.0.5, server: api.internal,
  request: "GET /health HTTP/1.1", upstream: "http://10.0.1.22:8080/health",
  host: "api.internal"

Nginx hit one upstream, it refused the connection, and stopped there. No retry. No failover. Request dead.

The culprit in your config:

proxy_next_upstream_tries 0;

Per the Nginx docs, a value of 0 means unlimited retries are NOT enabled — it actually means zero tries are permitted, effectively disabling the retry mechanism entirely. This is a classic misread of the directive semantics.

The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but the blast radius in production is severe:

Single upstream pod restart (routine K8s rolling deploy) → Nginx hits the restarting pod → proxy_next_upstream_tries 0 → no retry to the 4 healthy pods → 100% of in-flight requests to that upstream fail.
Health check pass-through illusion: Your load balancer health checks may still show green because Nginx itself is healthy. The upstream failure is invisible at the LB layer until error rates spike in APM.
Thundering herd amplification: If this is behind a retry-capable client (gRPC, Axios with retry), clients will immediately retry at the application layer, hammering the already-struggling upstream pool instead of letting Nginx absorb the retry internally.
Zero-downtime deploy becomes a lie: Any deployment strategy relying on Nginx upstream failover (blue/green, canary) silently breaks with this setting. Canary rollback does nothing — Nginx won't retry to the stable upstream.

How to Fix It (The Solution)

Basic Fix

Set proxy_next_upstream_tries to match your upstream pool size. For a 3-node pool, set it to 3.

 upstream backend_pool {
     server 10.0.1.10:8080;
     server 10.0.1.11:8080;
     server 10.0.1.12:8080;
 }

 server {
     location /api/ {
         proxy_pass http://backend_pool;
-        proxy_next_upstream_tries 0;
+        proxy_next_upstream_tries 3;
+        proxy_next_upstream error timeout http_502 http_503 http_504;
+        proxy_next_upstream_timeout 10s;
     }
 }

Enterprise Best Practice

In high-throughput environments, combine retry limits with per-upstream failure tracking using max_fails and fail_timeout to prevent Nginx from repeatedly routing to a known-bad upstream:

 upstream backend_pool {
-    server 10.0.1.10:8080;
-    server 10.0.1.11:8080;
-    server 10.0.1.12:8080;
+    server 10.0.1.10:8080 max_fails=2 fail_timeout=10s;
+    server 10.0.1.11:8080 max_fails=2 fail_timeout=10s;
+    server 10.0.1.12:8080 max_fails=2 fail_timeout=10s;
+    keepalive 32;
 }

 server {
     location /api/ {
         proxy_pass          http://backend_pool;
-        proxy_next_upstream_tries 0;
+        proxy_next_upstream         error timeout non_idempotent http_502 http_503 http_504;
+        proxy_next_upstream_tries   3;
+        proxy_next_upstream_timeout 15s;
+        proxy_connect_timeout       3s;
+        proxy_read_timeout          30s;
     }
 }

Key decisions here:

non_idempotent is explicitly added — by default Nginx won't retry POST/PATCH on error. Add this only if your upstream handlers are idempotent or you've confirmed safe retry semantics.
proxy_next_upstream_timeout 15s caps the total time Nginx will spend retrying across all tries, preventing retry storms from holding connections open indefinitely.
max_fails=2 fail_timeout=10s on each server means after 2 consecutive failures, Nginx marks that peer unavailable for 10 seconds, so proxy_next_upstream_tries skips it immediately on subsequent requests.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

This class of misconfiguration is fully preventable at the pipeline layer.

1. nginx -t is not enough. nginx -t validates syntax, not semantics. proxy_next_upstream_tries 0 passes -t cleanly.

2. Use gixy — Nginx static security/config analyzer:

pip install gixy
gixy /etc/nginx/nginx.conf

Write a custom gixy plugin or grep rule for your pipeline:

# Fail CI if proxy_next_upstream_tries is explicitly 0
grep -rn 'proxy_next_upstream_tries\s\+0' /etc/nginx/ && \
  echo "FATAL: proxy_next_upstream_tries 0 detected" && exit 1

3. OPA/Conftest policy for Nginx configs (if using Helm/Kustomize with Nginx ingress):

package nginx.upstream

deny[msg] {
  input.proxy_next_upstream_tries == 0
  msg := "proxy_next_upstream_tries must not be 0; upstream retry is disabled."
}

4. Checkov custom check if your Nginx config is managed via Terraform templatefile():

Write a CKV_CUSTOM_NGINX_001 check that parses the rendered template output and asserts proxy_next_upstream_tries is absent (defaults to 1) or explicitly ≥ 1.

5. Integration test with chaos: In staging, inject a tc netem network fault or kill -9 one upstream pod during load test. Assert via your APM that error rate stays below SLO threshold. If proxy_next_upstream_tries 0 sneaks back in, this test catches it before production.