Why does Nginx show 'no live upstreams' even though new pods are already running?

Nginx's upstream health state is passive by default — it only marks a peer down after real requests fail against it, not when Kubernetes removes the pod from Endpoints. There is a propagation delay between Kubernetes updating the Endpoints object, kube-proxy flushing iptables/IPVS rules, and Nginx's worker processes detecting the connection failure. During this window (typically 2–15 seconds depending on your cloud provider), Nginx continues routing to the terminating pod. The new pods being 'Running' is irrelevant — Nginx doesn't know about them unless you're using dynamic upstream resolution via a service mesh, the Nginx Plus API, or a DNS-based upstream with a short resolver TTL.

Does adding proxy_next_upstream http_502 fully solve the problem, or is it just a band-aid?

It's a critical mitigation but not a root fix. proxy_next_upstream http_502 retries the failed request on a different upstream peer, which eliminates end-user-visible 502s as long as at least one healthy peer exists. However, it doubles the latency for any retried request, increases backend load during deploys, and fails entirely if all upstreams are simultaneously in a bad state (e.g., you have only one replica). The root fix is combining preStop drain hooks with terminationGracePeriodSeconds tuning so that the pod stops receiving traffic before it stops processing it — making the retry unnecessary in the first place.

What is the correct terminationGracePeriodSeconds value to set?

The formula is: terminationGracePeriodSeconds = preStop sleep duration + max P99 request latency + application shutdown time + 10s buffer. For example, if your preStop sleep is 15s, your P99 latency is 5s, and your app takes 5s to flush connections and shut down cleanly, set terminationGracePeriodSeconds to 35–40s. Never set it below 30s for any production workload. The default of 30s is frequently too low for applications with database connection pools or long-running background jobs. If Kubernetes hits terminationGracePeriodSeconds before the process exits, it sends SIGKILL — which immediately drops all in-flight connections with no drain.

Fixing Nginx 'no live upstreams while connecting to upstream' After Rolling Deployments

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Kubernetes (or your orchestrator) sent SIGTERM to upstream pods during a rolling deploy, but Nginx kept routing requests to those terminating pods before health checks marked them down — resulting in no live upstreams while connecting to upstream and a flood of 502s.
How to fix it: Add preStop lifecycle hooks to drain connections, set terminationGracePeriodSeconds above your max request duration, and configure Nginx upstream health checks with max_fails/fail_timeout to evict dead peers fast.
Shortcut: Use our Client-Side Sandbox below to auto-refactor your Nginx config and Deployment YAML — no data leaves your browser.

The Incident (What Does the Error Mean?)

Raw log output from Nginx error log:

2024/01/15 03:42:17 [error] 31#31: *18423 no live upstreams while connecting to upstream,
  client: 10.0.1.45, server: api.internal, request: "POST /v1/process HTTP/1.1",
  upstream: "http://backend_pool", host: "api.internal"

Immediate consequence: Every request hitting that upstream group returns 502 Bad Gateway — not a single retry, not a failover. If your upstream block has only one server (or all servers are in the down state simultaneously), Nginx has nowhere to send the request and fails immediately. During a rolling deploy, the window between "old pod receives SIGTERM" and "new pod passes readiness probe" is exactly when this fires. Under high RPS, this window is not milliseconds — it's seconds, and every in-flight request in that window dies.

The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but the blast radius is severe:

The kill chain: Orchestrator issues SIGTERM → pod enters Terminating state → Nginx upstream list is NOT updated atomically → Nginx continues sending requests to the terminating pod → pod closes its listener → Nginx gets Connection refused or a reset → upstream is marked down — but only after max_fails threshold is hit (default: 1 fail, 10s timeout). During that 10-second window, every request to that upstream fails.
No preStop hook = zero drain time. Without a preStop sleep or /healthz de-registration, the pod stops accepting connections the instant SIGTERM is received. Nginx's upstream health state lags behind reality by the full fail_timeout window.
terminationGracePeriodSeconds too short (default: 30s) combined with a slow application shutdown means the pod gets SIGKILL while Nginx still believes it's alive.
Multiplier effect: With max_surge: 1 rolling strategy across 3 replicas, you have 33% of your upstream capacity terminating at any given moment. At 1000 RPS, that's 330 requests/sec hitting a dead upstream before health checks catch up.
No passive health checks configured means Nginx only discovers a dead upstream after a real user request fails — there is no proactive eviction.

How to Fix It (The Solution)

Root Cause Checklist

Missing preStop lifecycle hook on the upstream container
terminationGracePeriodSeconds ≤ max request latency
Nginx upstream block missing max_fails / fail_timeout tuning
No active health checks (health_check directive — Nginx Plus, or ngx_http_upstream_hc_module)
Nginx keepalive connections holding sockets open to terminating pods

Basic Fix — Kubernetes Deployment YAML

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: backend-api
 spec:
   strategy:
     type: RollingUpdate
     rollingUpdate:
-      maxUnavailable: 1
-      maxSurge: 1
+      maxUnavailable: 0
+      maxSurge: 1
   template:
     spec:
+      terminationGracePeriodSeconds: 60
       containers:
       - name: backend
         image: backend-api:v2
         readinessProbe:
           httpGet:
             path: /healthz
             port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 3
+          failureThreshold: 2
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sh", "-c", "sleep 15"]

Why sleep 15 in preStop: Kubernetes removes the pod from Endpoints and notifies kube-proxy/Nginx asynchronously. The preStop sleep gives the control plane time to propagate the endpoint removal before the process actually stops accepting connections. 15 seconds covers most cloud provider propagation latency; tune upward if you observe continued 502s.

Enterprise Best Practice — Nginx Upstream Configuration

 upstream backend_pool {
+    zone backend_pool 64k;
+    keepalive 32;
+    keepalive_requests 1000;
+    keepalive_timeout 60s;
 
-    server 10.0.1.10:8080;
-    server 10.0.1.11:8080;
-    server 10.0.1.12:8080;
+    server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
+    server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
+    server 10.0.1.12:8080 max_fails=2 fail_timeout=5s;
 }
 
 server {
     location /api/ {
         proxy_pass http://backend_pool;
+        proxy_next_upstream error timeout http_502 http_503;
+        proxy_next_upstream_tries 2;
+        proxy_next_upstream_timeout 10s;
+        proxy_connect_timeout 2s;
+        proxy_read_timeout 30s;
+        proxy_send_timeout 30s;
+        proxy_http_version 1.1;
+        proxy_set_header Connection "";
     }
 }

Critical directives explained:

max_fails=2 fail_timeout=5s — Nginx marks an upstream down after 2 consecutive failures within 5 seconds, instead of the default 1 fail / 10s. Reduces the blast radius window by 50%.
proxy_next_upstream error timeout http_502 — Retries the request on a different upstream peer on 502. This is the most impactful single-line fix for end-user impact during deploys.
proxy_http_version 1.1 + Connection "" — Required for keepalive upstreams. Without this, keepalive is silently disabled and you're creating a new TCP connection per request.
zone backend_pool 64k — Enables shared memory for upstream state, required for health_check (Nginx Plus) and consistent state across worker processes.

Nginx Plus / OpenResty Active Health Checks

 upstream backend_pool {
     zone backend_pool 64k;
+    # Nginx Plus active health check
+    # For OSS Nginx, use lua-resty-upstream-healthcheck
 
     server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
     server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
 }
 
 server {
     location /api/ {
         proxy_pass http://backend_pool;
+        health_check interval=3s fails=2 passes=1 uri=/healthz;
     }
 }

For OSS Nginx, use nginx_upstream_check_module or implement a sidecar health-check exporter that removes pods from the upstream list via the Nginx Plus API or a config reload.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Conftest / OPA Policy — Block Deployments Without `preStop`

# policy/require_prestop.rego
package kubernetes.deployment

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not container.lifecycle.preStop
  msg := sprintf(
    "Container '%v' in Deployment '%v' is missing a preStop lifecycle hook. Rolling deploys will cause upstream drain failures.",
    [container.name, input.metadata.name]
  )
}

deny[msg] {
  input.kind == "Deployment"
  input.spec.template.spec.terminationGracePeriodSeconds < 30
  msg := "terminationGracePeriodSeconds must be >= 30 to allow upstream drain during rolling deploys."
}

Enforce in your pipeline:

conftest test deployment.yaml --policy policy/

2. Checkov — Scan for Missing Readiness Probes

checkov -f deployment.yaml --check CKV_K8S_8,CKV_K8S_9
# CKV_K8S_8: Liveness probe configured
# CKV_K8S_9: Readiness probe configured

3. Helm Chart Defaults — Enforce via `values.schema.json`

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "properties": {
    "terminationGracePeriodSeconds": {
      "type": "integer",
      "minimum": 30,
      "description": "Must exceed max request duration to prevent upstream drain 502s"
    }
  },
  "required": ["terminationGracePeriodSeconds"]
}

4. Smoke Test in CD Pipeline

After every rolling deploy, run a 30-second load test targeting the upstream and assert zero 502s:

# Using k6
k6 run --duration 30s --vus 50 smoke_test.js
# Assert: http_req_failed rate == 0

If http_req_failed > 0 during the deploy window, auto-rollback via:

kubectl rollout undo deployment/backend-api

5. Deployment Annotation Enforcement via Kyverno

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-graceful-shutdown
spec:
  validationFailureAction: enforce
  rules:
  - name: check-prestop-hook
    match:
      resources:
        kinds: [Deployment]
    validate:
      message: "preStop hook required on all containers to prevent Nginx upstream 502s during rolling deploys."
      pattern:
        spec:
          template:
            spec:
              containers:
              - lifecycle:
                  preStop:
                    exec:
                      command: "?*"