Fixing 502 Bad Gateway After Nginx Reload with worker_processes auto: Root Cause & Production Fix
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins
TL;DR
- What broke: On
nginx -s reload, new worker processes spawn while old workers drain. Withworker_processes auto, CPU count detection inside containers returns the host CPU count, not the cgroup-limited count — spawning excess workers that exhaustworker_connectionsand file descriptors, causing upstream connection failures (502). - How to fix it: Pin
worker_processesto your actual cgroup CPU quota, setworker_rlimit_nofileexplicitly, and addproxy_next_upstream error timeoutwith an upstreamkeepalivepool. - Shortcut: Use our Client-Side Sandbox above to auto-refactor your nginx.conf — it detects the mismatch locally without sending your config anywhere.
The Incident (What Does the Error Mean?)
Your monitoring fires. Upstream services are healthy. But clients are getting:
502 Bad Gateway
nginx/1.25.x
In /var/log/nginx/error.log:
2024/01/15 03:42:17 [error] 3847#3847: *198423 connect() failed (99: Cannot assign requested address) while connecting to upstream
2024/01/15 03:42:17 [error] 3851#3851: *198441 no live upstreams while connecting to upstream, client: 10.0.1.45
2024/01/15 03:42:18 [warn] 3847#3847: *198450 upstream server temporarily disabled while connecting to upstream
This fires during and immediately after nginx -s reload. The upstream pods never went down. Nginx killed itself.
The Attack Vector / Blast Radius
Why worker_processes auto is a trap in containerized environments:
Nginx resolves auto by reading /proc/cpuinfo or calling sysconf(_SC_NPROCESSORS_ONLN). In a Kubernetes pod with resources.limits.cpu: "2", this syscall returns the node's physical CPU count — say, 96 cores on a c5.24xlarge. Nginx spawns 96 worker processes.
Each worker allocates worker_connections (default: 512 or your configured value) file descriptors. 96 workers × 512 connections = 49,152 FDs required. Your container's ulimit -n is probably 1024 or 4096. The OS starts rejecting socket() and connect() calls. Every proxied request to upstream fails. Every failure is a 502.
Cascading failure chain:
nginx -s reload→ old master signals workers to drain- New workers spawn (96 of them) → immediately exhaust FD limits
- Old workers still draining → total FD pressure doubles during overlap window
- Upstream keepalive pool is destroyed and rebuilt → cold connection storm hits upstream
- Upstream's own connection queue fills → upstream starts returning 503/504
- Now both layers are degraded. Recovery takes minutes, not seconds.
In non-containerized bare-metal: auto is usually correct. The blast radius is still real if worker_rlimit_nofile is unset and system ulimit is low.
How to Fix It
Basic Fix: Pin Worker Count and File Descriptor Limits
- worker_processes auto;
+ worker_processes 2; # Match your cgroup CPU limit exactly
events {
- worker_connections 512;
+ worker_connections 4096;
+ use epoll;
+ multi_accept on;
}
http {
+ worker_rlimit_nofile 65535; # Must be > (worker_processes * worker_connections * 2)
Set
worker_processesto the integer value of yourresources.limits.cpu. For fractional limits like"1500m", use1.
Enterprise Best Practice: Upstream Resilience + Keepalive Pool
The 502 window during reload is also caused by the upstream keepalive pool being torn down. Fix the proxy layer too:
upstream backend {
server 10.0.2.10:8080;
server 10.0.2.11:8080;
+ keepalive 32; # Persistent connection pool survives worker reload overlap
+ keepalive_requests 1000;
+ keepalive_timeout 60s;
}
server {
location / {
proxy_pass http://backend;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+ proxy_next_upstream error timeout http_502 http_503;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream_timeout 10s;
- proxy_connect_timeout 60s;
+ proxy_connect_timeout 5s; # Fail fast, let next_upstream retry
+ proxy_read_timeout 30s;
}
}
For Kubernetes deployments — set this in your nginx container spec:
containers:
- name: nginx
resources:
limits:
- cpu: "2"
+ cpu: "2" # This MUST match worker_processes integer value
securityContext:
+ sysctls: [] # Don't rely on sysctls; set worker_rlimit_nofile in nginx.conf
If you must use auto (dynamic environments), detect the cgroup limit at startup:
- worker_processes auto;
+ # In your entrypoint.sh, before nginx starts:
+ # CPUS=$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us | awk -v period=$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us) '{printf "%d", $1/period}')
+ # sed -i "s/worker_processes auto/worker_processes ${CPUS}/" /etc/nginx/nginx.conf
+ worker_processes auto; # Only safe if NOT running in CPU-limited containers
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint nginx.conf in your pipeline with gixy:
# Dockerfile.ci
RUN pip install gixy && gixy /etc/nginx/nginx.conf
# Catches: worker_processes misconfig, missing proxy_next_upstream, SSRF vectors
2. Enforce worker_processes policy with OPA/Conftest:
# policy/nginx_workers.rego
package nginx
deny[msg] {
input.worker_processes == "auto"
msg := "worker_processes 'auto' is banned in containerized deployments. Pin to CPU limit integer."
}
deny[msg] {
not input.worker_rlimit_nofile
msg := "worker_rlimit_nofile must be explicitly set. Default ulimit is unsafe."
}
conftest test nginx.conf --policy policy/
3. Smoke-test reload in staging with connection hold:
#!/bin/bash
# ci/reload_test.sh
# Hold 100 connections open, trigger reload, assert zero 502s
nginx -s reload
for i in $(seq 1 100); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/healthz)
if [ "$STATUS" == "502" ]; then
echo "FAIL: 502 detected post-reload"
exit 1
fi
done
echo "PASS: No 502s during reload window"
4. Prometheus alert on reload-correlated 502 spikes:
# alerts/nginx_reload_502.yaml
- alert: NginxReload502Spike
expr: |
increase(nginx_http_requests_total{status="502"}[2m]) > 10
for: 1m
labels:
severity: critical
annotations:
summary: "502 spike detected — likely nginx reload with worker starvation"