Fixing Nginx 'no live upstreams while connecting to upstream' After Rolling Deployments
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Kubernetes (or your orchestrator) sent
SIGTERMto upstream pods during a rolling deploy, but Nginx kept routing requests to those terminating pods before health checks marked them down — resulting inno live upstreams while connecting to upstreamand a flood of 502s. - How to fix it: Add
preStoplifecycle hooks to drain connections, setterminationGracePeriodSecondsabove your max request duration, and configure Nginxupstreamhealth checks withmax_fails/fail_timeoutto evict dead peers fast. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your Nginx config and Deployment YAML — no data leaves your browser.
The Incident (What Does the Error Mean?)
Raw log output from Nginx error log:
2024/01/15 03:42:17 [error] 31#31: *18423 no live upstreams while connecting to upstream,
client: 10.0.1.45, server: api.internal, request: "POST /v1/process HTTP/1.1",
upstream: "http://backend_pool", host: "api.internal"
Immediate consequence: Every request hitting that upstream group returns 502 Bad Gateway — not a single retry, not a failover. If your upstream block has only one server (or all servers are in the down state simultaneously), Nginx has nowhere to send the request and fails immediately. During a rolling deploy, the window between "old pod receives SIGTERM" and "new pod passes readiness probe" is exactly when this fires. Under high RPS, this window is not milliseconds — it's seconds, and every in-flight request in that window dies.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a security exploit — but the blast radius is severe:
The kill chain: Orchestrator issues
SIGTERM→ pod entersTerminatingstate → Nginx upstream list is NOT updated atomically → Nginx continues sending requests to the terminating pod → pod closes its listener → Nginx getsConnection refusedor a reset → upstream is markeddown— but only aftermax_failsthreshold is hit (default: 1 fail, 10s timeout). During that 10-second window, every request to that upstream fails.No
preStophook = zero drain time. Without apreStopsleep or/healthzde-registration, the pod stops accepting connections the instantSIGTERMis received. Nginx's upstream health state lags behind reality by the fullfail_timeoutwindow.terminationGracePeriodSecondstoo short (default: 30s) combined with a slow application shutdown means the pod getsSIGKILLwhile Nginx still believes it's alive.Multiplier effect: With
max_surge: 1rolling strategy across 3 replicas, you have 33% of your upstream capacity terminating at any given moment. At 1000 RPS, that's 330 requests/sec hitting a dead upstream before health checks catch up.No passive health checks configured means Nginx only discovers a dead upstream after a real user request fails — there is no proactive eviction.
How to Fix It (The Solution)
Root Cause Checklist
- Missing
preStoplifecycle hook on the upstream container -
terminationGracePeriodSeconds≤ max request latency - Nginx upstream block missing
max_fails/fail_timeouttuning - No active health checks (
health_checkdirective — Nginx Plus, orngx_http_upstream_hc_module) - Nginx
keepaliveconnections holding sockets open to terminating pods
Basic Fix — Kubernetes Deployment YAML
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-api
spec:
strategy:
type: RollingUpdate
rollingUpdate:
- maxUnavailable: 1
- maxSurge: 1
+ maxUnavailable: 0
+ maxSurge: 1
template:
spec:
+ terminationGracePeriodSeconds: 60
containers:
- name: backend
image: backend-api:v2
readinessProbe:
httpGet:
path: /healthz
port: 8080
+ initialDelaySeconds: 5
+ periodSeconds: 3
+ failureThreshold: 2
+ lifecycle:
+ preStop:
+ exec:
+ command: ["/bin/sh", "-c", "sleep 15"]
Why sleep 15 in preStop: Kubernetes removes the pod from Endpoints and notifies kube-proxy/Nginx asynchronously. The preStop sleep gives the control plane time to propagate the endpoint removal before the process actually stops accepting connections. 15 seconds covers most cloud provider propagation latency; tune upward if you observe continued 502s.
Enterprise Best Practice — Nginx Upstream Configuration
upstream backend_pool {
+ zone backend_pool 64k;
+ keepalive 32;
+ keepalive_requests 1000;
+ keepalive_timeout 60s;
- server 10.0.1.10:8080;
- server 10.0.1.11:8080;
- server 10.0.1.12:8080;
+ server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
+ server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
+ server 10.0.1.12:8080 max_fails=2 fail_timeout=5s;
}
server {
location /api/ {
proxy_pass http://backend_pool;
+ proxy_next_upstream error timeout http_502 http_503;
+ proxy_next_upstream_tries 2;
+ proxy_next_upstream_timeout 10s;
+ proxy_connect_timeout 2s;
+ proxy_read_timeout 30s;
+ proxy_send_timeout 30s;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
}
}
Critical directives explained:
max_fails=2 fail_timeout=5s— Nginx marks an upstreamdownafter 2 consecutive failures within 5 seconds, instead of the default 1 fail / 10s. Reduces the blast radius window by 50%.proxy_next_upstream error timeout http_502— Retries the request on a different upstream peer on 502. This is the most impactful single-line fix for end-user impact during deploys.proxy_http_version 1.1+Connection ""— Required for keepalive upstreams. Without this, keepalive is silently disabled and you're creating a new TCP connection per request.zone backend_pool 64k— Enables shared memory for upstream state, required forhealth_check(Nginx Plus) and consistent state across worker processes.
Nginx Plus / OpenResty Active Health Checks
upstream backend_pool {
zone backend_pool 64k;
+ # Nginx Plus active health check
+ # For OSS Nginx, use lua-resty-upstream-healthcheck
server 10.0.1.10:8080 max_fails=2 fail_timeout=5s;
server 10.0.1.11:8080 max_fails=2 fail_timeout=5s;
}
server {
location /api/ {
proxy_pass http://backend_pool;
+ health_check interval=3s fails=2 passes=1 uri=/healthz;
}
}
For OSS Nginx, use nginx_upstream_check_module or implement a sidecar health-check exporter that removes pods from the upstream list via the Nginx Plus API or a config reload.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Conftest / OPA Policy — Block Deployments Without preStop
# policy/require_prestop.rego
package kubernetes.deployment
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.lifecycle.preStop
msg := sprintf(
"Container '%v' in Deployment '%v' is missing a preStop lifecycle hook. Rolling deploys will cause upstream drain failures.",
[container.name, input.metadata.name]
)
}
deny[msg] {
input.kind == "Deployment"
input.spec.template.spec.terminationGracePeriodSeconds < 30
msg := "terminationGracePeriodSeconds must be >= 30 to allow upstream drain during rolling deploys."
}
Enforce in your pipeline:
conftest test deployment.yaml --policy policy/
2. Checkov — Scan for Missing Readiness Probes
checkov -f deployment.yaml --check CKV_K8S_8,CKV_K8S_9
# CKV_K8S_8: Liveness probe configured
# CKV_K8S_9: Readiness probe configured
3. Helm Chart Defaults — Enforce via values.schema.json
{
"$schema": "http://json-schema.org/draft-07/schema",
"properties": {
"terminationGracePeriodSeconds": {
"type": "integer",
"minimum": 30,
"description": "Must exceed max request duration to prevent upstream drain 502s"
}
},
"required": ["terminationGracePeriodSeconds"]
}
4. Smoke Test in CD Pipeline
After every rolling deploy, run a 30-second load test targeting the upstream and assert zero 502s:
# Using k6
k6 run --duration 30s --vus 50 smoke_test.js
# Assert: http_req_failed rate == 0
If http_req_failed > 0 during the deploy window, auto-rollback via:
kubectl rollout undo deployment/backend-api
5. Deployment Annotation Enforcement via Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-graceful-shutdown
spec:
validationFailureAction: enforce
rules:
- name: check-prestop-hook
match:
resources:
kinds: [Deployment]
validate:
message: "preStop hook required on all containers to prevent Nginx upstream 502s during rolling deploys."
pattern:
spec:
template:
spec:
containers:
- lifecycle:
preStop:
exec:
command: "?*"