Initializing Enclave...

Fixing 502 Bad Gateway in OpenResty: Lua Upstream Module Errors Diagnosed and Resolved

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

  • What broke: OpenResty's lua-resty-upstream or balancer_by_lua_block failed to select a valid peer — either the keepalive pool is exhausted, the upstream peer is marked down, or ngx.balancer was called outside a valid phase, returning 502 to every client.
  • How to fix it: Audit your balancer_by_lua_block logic, increase or properly release keepalive connections, reset failed peer state, and validate upstream health-check thresholds.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor this — paste your nginx.conf and Lua upstream block and get a corrected diff without sending your config to any external server.

The Incident (What Does the Error Mean?)

Your logs look like this:

2024/01/15 03:42:17 [error] 1234#0: *9182 lua balancer: failed to get peer: no available peer, ...
client: 10.0.1.55, server: api.internal, request: "POST /v2/process HTTP/1.1",
upstream: "http://backend_pool/v2/process", host: "api.internal"
2024/01/15 03:42:17 [warn]  1234#0: *9182 upstream server temporarily disabled while connecting to upstream

Or the lua-resty-upstream-healthcheck variant:

[error] peers in upstream "backend_pool" are all marked down after 3 consecutive failures

Immediate consequence: Every request hitting that server {} block returns HTTP 502. OpenResty has no valid peer to proxy to. If you are running a single upstream group without fallback, this is a total service outage.


The Attack Vector / Blast Radius

This is not a security exploit — it is a cascading availability failure with the following blast radius:

  1. Keepalive pool exhaustion: lua-resty-upstream maintains a per-worker keepalive queue. If your Lua code calls upstream:set_keepalive() inconsistently — or never calls it on error paths — sockets leak. Under load, the pool hits pool_size and new connections are rejected, producing 502s even when backend servers are healthy.

  2. Peer marked-down death spiral: lua-resty-upstream-healthcheck uses a shared memory zone (lua_shared_dict) to track peer failure counts. If fail_timeout is too short and max_fails is too low (e.g., max_fails=1, fail_timeout=10s), a single slow backend response marks the peer down. With only 2–3 upstreams, one bad deploy marks all peers down simultaneously.

  3. balancer_by_lua_block phase violation: Calling ngx.balancer.set_current_peer() outside the balancer phase — or failing to call it at all on a code path — causes Nginx to have no peer set, which always produces 502. This is a silent logic bug that only surfaces under specific routing conditions.

  4. Cascading retry amplification: If proxy_next_upstream is set to retry on 502, a single bad upstream causes Nginx to fan out retries across all peers, multiplying backend load by your upstream count, accelerating the death spiral.


How to Fix It (The Solution)

Root Cause 1: Keepalive Pool Leak

Basic Fix — Always release the connection on every code path:

- local ok, err = upstream:set_keepalive()
- if not ok then
-     ngx.log(ngx.ERR, "failed to set keepalive: ", err)
-     -- BUG: falls through, socket leaked
- end

+ local ok, err = upstream:set_keepalive(60000, 100)  -- 60s timeout, pool size 100
+ if not ok then
+     ngx.log(ngx.ERR, "failed to set keepalive: ", err)
+     upstream:close()  -- CRITICAL: always close on keepalive failure
+ end

Enterprise Best Practice — Wrap all upstream calls in a pcall with guaranteed cleanup:

- local res, err = upstream:request(opts)
- if not res then
-     ngx.status = 502
-     ngx.exit(502)
- end

+ local ok, res_or_err = pcall(function()
+     return upstream:request(opts)
+ end)
+ if not ok or not res_or_err then
+     ngx.log(ngx.ERR, "upstream request failed: ", res_or_err)
+     upstream:close()           -- release socket unconditionally
+     ngx.status = 502
+     return ngx.exit(502)
+ end
+ local res = res_or_err

Root Cause 2: Peers Marked Down — Misconfigured Health Check Thresholds

  lua_shared_dict healthcheck 1m;

  init_worker_by_lua_block {
      local hc = require "resty.upstream.healthcheck"
      local ok, err = hc.spawn_checker({
          shm = "healthcheck",
          upstream = "backend_pool",
          type = "http",
          http_req = "GET /health HTTP/1.0\r\nHost: backend\r\n\r\n",
-         interval = 1000,       -- 1s: too aggressive, causes flapping
-         timeout = 500,
-         fall = 1,              -- marks down after 1 failure: too sensitive
-         rise = 1,
+         interval = 5000,       -- 5s: stable polling interval
+         timeout = 2000,        -- 2s: realistic backend timeout
+         fall = 3,              -- require 3 consecutive failures before marking down
+         rise = 2,              -- require 2 successes to mark back up
          valid_statuses = {200, 204},
          concurrency = 10,
      })
      if not ok then
          ngx.log(ngx.ERR, "healthcheck spawn failed: ", err)
      end
  }

Root Cause 3: balancer_by_lua_block Missing Peer Assignment

  upstream backend_pool {
      server 0.0.0.1;  -- placeholder, required by Nginx parser
      balancer_by_lua_block {
          local balancer = require "ngx.balancer"
          local upstream_obj = ngx.ctx.upstream

-         if upstream_obj.peer then
-             balancer.set_current_peer(upstream_obj.peer.host, upstream_obj.peer.port)
-             -- BUG: no else branch — Nginx has no peer if condition is false → 502
-         end

+         local peer = upstream_obj and upstream_obj.peer
+         if not peer then
+             ngx.log(ngx.ERR, "balancer: no peer available in context")
+             return ngx.exit(502)
+         end
+         local ok, err = balancer.set_current_peer(peer.host, peer.port)
+         if not ok then
+             ngx.log(ngx.ERR, "balancer: set_current_peer failed: ", err)
+             return ngx.exit(502)
+         end
+         -- set timeouts explicitly to avoid inheriting nginx defaults
+         balancer.set_timeouts(1, 5, 5)  -- connect, send, read (seconds)
      }
  }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lua Static Analysis in Pipeline

Add luacheck to your CI with a ruleset that catches missing close() calls:

# .luacheckrc
std = "ngx_lua"
ignore = {"611", "612"}
globals = {"ngx", "upstream", "balancer"}
# Flag any function that calls upstream:request without a paired close/set_keepalive
# .github/workflows/lint.yml
- name: Lua upstream lint
  run: |
    luarocks install luacheck
    luacheck lua/ --config .luacheckrc --fatal-warnings

2. Nginx Config Validation Gate

# Run in CI before every deploy
docker run --rm -v $(pwd)/nginx:/etc/nginx openresty/openresty:alpine \
    openresty -t -c /etc/nginx/nginx.conf

3. Shared Dict Sizing Check (OPA Policy)

If you use Terraform to provision OpenResty config maps, add an OPA/Conftest policy:

# policy/openresty.rego
package openresty

deny[msg] {
    input.lua_shared_dict_size_mb < 5
    msg := "lua_shared_dict for healthcheck must be >= 5MB to prevent peer state eviction under load"
}

4. Synthetic Canary on Upstream Health Endpoint

Deploy a Prometheus blackbox exporter probe against /health on each upstream. Alert at fall=2 equivalent — before OpenResty marks the peer down — giving you a 10-second window to intervene before 502s hit clients.

# prometheus/blackbox_targets.yml
- targets:
  - http://backend-1.internal:8080/health
  - http://backend-2.internal:8080/health
  labels:
    upstream_pool: backend_pool
    alert_threshold: "2_consecutive_failures"

Rule: If your synthetic probe fires and your fall counter hasn't triggered yet, your interval is too long. Tighten the probe interval, not the fall threshold.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →