Initializing Enclave...

Fixing Nginx 'no live upstreams while connecting to upstream' After Rolling Deployments

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: Kubernetes (or your orchestrator) sent SIGTERM to upstream pods during a rolling deploy, but Nginx kept routing requests to those terminating pods before health checks marked them down — resulting in no live upstreams while connecting to upstream and a flood of 502s.
  • How to fix it: Add preStop lifecycle hooks to drain connections, set terminationGracePeriodSeconds above your max request duration, and configure Nginx upstream health checks with max_fails/fail_timeout to evict dead peers fast.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your Nginx config and Deployment YAML — no data leaves your browser.

The Incident (What Does the Error Mean?)

Raw log output from Nginx error log:

2024/01/15 03:42:17 [error] 31#31: *18423 no live upstreams while connecting to upstream,
  client: 10.0.1.45, server: api.internal, request: "POST /v1/process HTTP/1.1",
  upstream: "http://backend_pool", host: "api.internal"

Immediate consequence: Every request hitting that upstream group returns 502 Bad Gateway — not a single retry, not a failover. If your upstream block has only one server (or all servers are in the down state simultaneously), Nginx has nowhere to send the request and fails immediately. During a rolling deploy, the window between "old pod receives SIGTERM" and "new pod passes readiness probe" is exactly when this fires. Under high RPS, this window is not milliseconds — it's seconds, and every in-flight request in that window dies.


The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but the blast radius is severe:

  1. The kill chain: Orchestrator issues SIGTERM → pod enters Terminating state → Nginx upstream list is NOT updated atomically → Nginx continues sending requests to the terminating pod → pod closes its listener → Nginx gets Connection refused or a reset → upstream is marked down — but only after max_fails threshold is hit (default: 1 fail, 10s timeout). During that 10-second window, every request to that upstream fails.

  2. No preStop hook = zero drain time. Without a preStop sleep or /healthz de-registration, the pod stops accepting connections the instant SIGTERM is received. Nginx's upstream health state lags behind reality by the full fail_timeout window.

  3. terminationGracePeriodSeconds too short (default: 30s) combined with a slow application shutdown means the pod gets SIGKILL while Nginx still believes it's alive.

  4. Multiplier effect: With max_surge: 1 rolling strategy across 3 replicas, you have 33% of your upstream capacity terminating at any given moment. At 1000 RPS, that's 330 requests/sec hitting a dead upstream before health checks catch up.

  5. No passive health checks configured means Nginx only discovers a dead upstream after a real user request fails — there is no proactive eviction.


How to Fix It (The Solution)

Root Cause Checklist

  • Missing preStop lifecycle hook on the upstream container
  • terminationGracePeriodSeconds ≤ max request latency
  • Nginx upstream block missing max_fails / fail_timeout tuning
  • No active health checks (health_check directive — Nginx Plus, or ngx_http_upstream_hc_module)
  • Nginx keepalive connections holding sockets open to terminating pods

Basic Fix — Kubernetes Deployment YAML

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: backend-api
 spec:
   strategy:
     type: RollingUpdate
     rollingUpdate:
-      maxUnavailable: 1
-      maxSurge: 1
+      maxUnavailable: 0
+      maxSurge: 1
   template:
     spec:
+      terminationGracePeriodSeconds: 60
       containers:
       - name: backend
         image: backend-api:v2
         readinessProbe:
           httpGet:
             path: /healthz
             port: 8080
+          initialDelaySeconds: 5
+          periodSeconds: 3
+          failureThreshold: 2
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sh", "-c", "sleep 15"]

Why sleep 15 in preStop: Kubernetes removes the pod from Endpoints and notifies kube-proxy/Nginx asynchronously. The preStop sleep gives the control plane time to propagate the endpoint removal before the process actually stops accepting connections. 15 seconds covers most cloud provider propagation latency; tune upward if you observe continued 502s.


Enterprise Best Practice — Nginx Upstream Configuration

 upstream backend_pool {
+    zone backend_pool 64k;
+    keepalive 32;
+    keepalive_requests 1000;
+    keepalive_timeout 60s;
 
-    server 10.0.1.10:8080;
-    server 10.0.1.11:8080;
-    server 10.0.1.12:8080;
+    server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
+    server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
+    server 10.0.1.12:8080 max_fails=2 fail_timeout=5s;
 }
 
 server {
     location /api/ {
         proxy_pass http://backend_pool;
+        proxy_next_upstream error timeout http_502 http_503;
+        proxy_next_upstream_tries 2;
+        proxy_next_upstream_timeout 10s;
+        proxy_connect_timeout 2s;
+        proxy_read_timeout 30s;
+        proxy_send_timeout 30s;
+        proxy_http_version 1.1;
+        proxy_set_header Connection "";
     }
 }

Critical directives explained:

  • max_fails=2 fail_timeout=5s — Nginx marks an upstream down after 2 consecutive failures within 5 seconds, instead of the default 1 fail / 10s. Reduces the blast radius window by 50%.
  • proxy_next_upstream error timeout http_502 — Retries the request on a different upstream peer on 502. This is the most impactful single-line fix for end-user impact during deploys.
  • proxy_http_version 1.1 + Connection "" — Required for keepalive upstreams. Without this, keepalive is silently disabled and you're creating a new TCP connection per request.
  • zone backend_pool 64k — Enables shared memory for upstream state, required for health_check (Nginx Plus) and consistent state across worker processes.

Nginx Plus / OpenResty Active Health Checks

 upstream backend_pool {
     zone backend_pool 64k;
+    # Nginx Plus active health check
+    # For OSS Nginx, use lua-resty-upstream-healthcheck
 
     server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
     server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
 }
 
 server {
     location /api/ {
         proxy_pass http://backend_pool;
+        health_check interval=3s fails=2 passes=1 uri=/healthz;
     }
 }

For OSS Nginx, use nginx_upstream_check_module or implement a sidecar health-check exporter that removes pods from the upstream list via the Nginx Plus API or a config reload.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Conftest / OPA Policy — Block Deployments Without preStop

# policy/require_prestop.rego
package kubernetes.deployment

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not container.lifecycle.preStop
  msg := sprintf(
    "Container '%v' in Deployment '%v' is missing a preStop lifecycle hook. Rolling deploys will cause upstream drain failures.",
    [container.name, input.metadata.name]
  )
}

deny[msg] {
  input.kind == "Deployment"
  input.spec.template.spec.terminationGracePeriodSeconds < 30
  msg := "terminationGracePeriodSeconds must be >= 30 to allow upstream drain during rolling deploys."
}

Enforce in your pipeline:

conftest test deployment.yaml --policy policy/

2. Checkov — Scan for Missing Readiness Probes

checkov -f deployment.yaml --check CKV_K8S_8,CKV_K8S_9
# CKV_K8S_8: Liveness probe configured
# CKV_K8S_9: Readiness probe configured

3. Helm Chart Defaults — Enforce via values.schema.json

{
  "$schema": "http://json-schema.org/draft-07/schema",
  "properties": {
    "terminationGracePeriodSeconds": {
      "type": "integer",
      "minimum": 30,
      "description": "Must exceed max request duration to prevent upstream drain 502s"
    }
  },
  "required": ["terminationGracePeriodSeconds"]
}

4. Smoke Test in CD Pipeline

After every rolling deploy, run a 30-second load test targeting the upstream and assert zero 502s:

# Using k6
k6 run --duration 30s --vus 50 smoke_test.js
# Assert: http_req_failed rate == 0

If http_req_failed > 0 during the deploy window, auto-rollback via:

kubectl rollout undo deployment/backend-api

5. Deployment Annotation Enforcement via Kyverno

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-graceful-shutdown
spec:
  validationFailureAction: enforce
  rules:
  - name: check-prestop-hook
    match:
      resources:
        kinds: [Deployment]
    validate:
      message: "preStop hook required on all containers to prevent Nginx upstream 502s during rolling deploys."
      pattern:
        spec:
          template:
            spec:
              containers:
              - lifecycle:
                  preStop:
                    exec:
                      command: "?*"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →