Fixing 502 Bad Gateway in OpenResty: Lua Upstream Module Errors Diagnosed and Resolved
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: OpenResty's
lua-resty-upstreamorbalancer_by_lua_blockfailed to select a valid peer — either the keepalive pool is exhausted, the upstream peer is marked down, orngx.balancerwas called outside a valid phase, returning 502 to every client. - How to fix it: Audit your
balancer_by_lua_blocklogic, increase or properly release keepalive connections, reset failed peer state, and validate upstream health-check thresholds. - Shortcut: Use our Client-Side Sandbox below to auto-refactor this — paste your
nginx.confand Lua upstream block and get a corrected diff without sending your config to any external server.
The Incident (What Does the Error Mean?)
Your logs look like this:
2024/01/15 03:42:17 [error] 1234#0: *9182 lua balancer: failed to get peer: no available peer, ...
client: 10.0.1.55, server: api.internal, request: "POST /v2/process HTTP/1.1",
upstream: "http://backend_pool/v2/process", host: "api.internal"
2024/01/15 03:42:17 [warn] 1234#0: *9182 upstream server temporarily disabled while connecting to upstream
Or the lua-resty-upstream-healthcheck variant:
[error] peers in upstream "backend_pool" are all marked down after 3 consecutive failures
Immediate consequence: Every request hitting that server {} block returns HTTP 502. OpenResty has no valid peer to proxy to. If you are running a single upstream group without fallback, this is a total service outage.
The Attack Vector / Blast Radius
This is not a security exploit — it is a cascading availability failure with the following blast radius:
Keepalive pool exhaustion:
lua-resty-upstreammaintains a per-worker keepalive queue. If your Lua code callsupstream:set_keepalive()inconsistently — or never calls it on error paths — sockets leak. Under load, the pool hitspool_sizeand new connections are rejected, producing 502s even when backend servers are healthy.Peer marked-down death spiral:
lua-resty-upstream-healthcheckuses a shared memory zone (lua_shared_dict) to track peer failure counts. Iffail_timeoutis too short andmax_failsis too low (e.g.,max_fails=1, fail_timeout=10s), a single slow backend response marks the peer down. With only 2–3 upstreams, one bad deploy marks all peers down simultaneously.balancer_by_lua_blockphase violation: Callingngx.balancer.set_current_peer()outside thebalancerphase — or failing to call it at all on a code path — causes Nginx to have no peer set, which always produces 502. This is a silent logic bug that only surfaces under specific routing conditions.Cascading retry amplification: If
proxy_next_upstreamis set to retry on 502, a single bad upstream causes Nginx to fan out retries across all peers, multiplying backend load by your upstream count, accelerating the death spiral.
How to Fix It (The Solution)
Root Cause 1: Keepalive Pool Leak
Basic Fix — Always release the connection on every code path:
- local ok, err = upstream:set_keepalive()
- if not ok then
- ngx.log(ngx.ERR, "failed to set keepalive: ", err)
- -- BUG: falls through, socket leaked
- end
+ local ok, err = upstream:set_keepalive(60000, 100) -- 60s timeout, pool size 100
+ if not ok then
+ ngx.log(ngx.ERR, "failed to set keepalive: ", err)
+ upstream:close() -- CRITICAL: always close on keepalive failure
+ end
Enterprise Best Practice — Wrap all upstream calls in a pcall with guaranteed cleanup:
- local res, err = upstream:request(opts)
- if not res then
- ngx.status = 502
- ngx.exit(502)
- end
+ local ok, res_or_err = pcall(function()
+ return upstream:request(opts)
+ end)
+ if not ok or not res_or_err then
+ ngx.log(ngx.ERR, "upstream request failed: ", res_or_err)
+ upstream:close() -- release socket unconditionally
+ ngx.status = 502
+ return ngx.exit(502)
+ end
+ local res = res_or_err
Root Cause 2: Peers Marked Down — Misconfigured Health Check Thresholds
lua_shared_dict healthcheck 1m;
init_worker_by_lua_block {
local hc = require "resty.upstream.healthcheck"
local ok, err = hc.spawn_checker({
shm = "healthcheck",
upstream = "backend_pool",
type = "http",
http_req = "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n",
- interval = 1000, -- 1s: too aggressive, causes flapping
- timeout = 500,
- fall = 1, -- marks down after 1 failure: too sensitive
- rise = 1,
+ interval = 5000, -- 5s: stable polling interval
+ timeout = 2000, -- 2s: realistic backend timeout
+ fall = 3, -- require 3 consecutive failures before marking down
+ rise = 2, -- require 2 successes to mark back up
valid_statuses = {200, 204},
concurrency = 10,
})
if not ok then
ngx.log(ngx.ERR, "healthcheck spawn failed: ", err)
end
}
Root Cause 3: balancer_by_lua_block Missing Peer Assignment
upstream backend_pool {
server 0.0.0.1; -- placeholder, required by Nginx parser
balancer_by_lua_block {
local balancer = require "ngx.balancer"
local upstream_obj = ngx.ctx.upstream
- if upstream_obj.peer then
- balancer.set_current_peer(upstream_obj.peer.host, upstream_obj.peer.port)
- -- BUG: no else branch — Nginx has no peer if condition is false → 502
- end
+ local peer = upstream_obj and upstream_obj.peer
+ if not peer then
+ ngx.log(ngx.ERR, "balancer: no peer available in context")
+ return ngx.exit(502)
+ end
+ local ok, err = balancer.set_current_peer(peer.host, peer.port)
+ if not ok then
+ ngx.log(ngx.ERR, "balancer: set_current_peer failed: ", err)
+ return ngx.exit(502)
+ end
+ -- set timeouts explicitly to avoid inheriting nginx defaults
+ balancer.set_timeouts(1, 5, 5) -- connect, send, read (seconds)
}
}
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lua Static Analysis in Pipeline
Add luacheck to your CI with a ruleset that catches missing close() calls:
# .luacheckrc
std = "ngx_lua"
ignore = {"611", "612"}
globals = {"ngx", "upstream", "balancer"}
# Flag any function that calls upstream:request without a paired close/set_keepalive
# .github/workflows/lint.yml
- name: Lua upstream lint
run: |
luarocks install luacheck
luacheck lua/ --config .luacheckrc --fatal-warnings
2. Nginx Config Validation Gate
# Run in CI before every deploy
docker run --rm -v $(pwd)/nginx:/etc/nginx openresty/openresty:alpine \
openresty -t -c /etc/nginx/nginx.conf
3. Shared Dict Sizing Check (OPA Policy)
If you use Terraform to provision OpenResty config maps, add an OPA/Conftest policy:
# policy/openresty.rego
package openresty
deny[msg] {
input.lua_shared_dict_size_mb < 5
msg := "lua_shared_dict for healthcheck must be >= 5MB to prevent peer state eviction under load"
}
4. Synthetic Canary on Upstream Health Endpoint
Deploy a Prometheus blackbox exporter probe against /health on each upstream. Alert at fall=2 equivalent — before OpenResty marks the peer down — giving you a 10-second window to intervene before 502s hit clients.
# prometheus/blackbox_targets.yml
- targets:
- http://backend-1.internal:8080/health
- http://backend-2.internal:8080/health
labels:
upstream_pool: backend_pool
alert_threshold: "2_consecutive_failures"
Rule: If your synthetic probe fires and your fall counter hasn't triggered yet, your interval is too long. Tighten the probe interval, not the fall threshold.