Fixing 'No Server Is Available for Upstream' in Nginx Dynamic Upstream with Consul DNS Resolver
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Nginx's dynamic upstream resolver lost all healthy Consul endpoints — either DNS TTL expired with no
valid=override, theresolverdirective is missing/misconfigured, or all Consul service instances deregistered and Nginx has no fallback. - How to fix it: Add a
resolverdirective pointing at your Consul agent, setvalid=TTL, useset $upstreamvariable-based proxying to force runtime DNS re-resolution, and configureproxy_next_upstreamwith a fallback. - Sandbox: Use our Client-Side Sandbox below to auto-refactor your failing Nginx config — paste it in and get the corrected upstream block without sending your internal hostnames anywhere.
The Incident (What Does the Error Mean?)
You will see this in /var/log/nginx/error.log:
2024/01/15 03:42:17 [error] 1234#1234: *58291 no servers are available while connecting to upstream,
client: 10.0.1.45, server: api.internal, request: "GET /health HTTP/1.1",
upstream: "consul_backend", host: "api.internal"
Nginx evaluated the upstream group named consul_backend (or equivalent) at request time and found zero resolvable, reachable peers. This is a hard 502. Every in-flight request to this upstream fails instantly — there is no retry to another backend, no queue, no graceful degradation. Your service is down.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a single-service blip:
DNS TTL expiry with no re-resolution: If you used a static
upstream {}block with a Consul FQDN (e.g.,server myservice.service.consul), Nginx resolves this once at startup. When Consul rotates IPs after a deployment or health-check failure, Nginx is still pointing at stale, dead IPs. Theresolverdirective withvalid=is the only escape hatch.Consul health check cascade: A rolling deployment deregisters old instances before new ones pass their health checks. The window — even 10–30 seconds — is enough for Nginx to have zero valid upstreams. Without
proxy_next_upstream error timeoutand abackupserver orkeepalivepool, every request in that window hard-fails.Shared memory zone exhaustion: If
zone consul_zone 64kis undersized for the number of dynamic peers being tracked, Nginx silently drops peer state. Under high upstream churn (frequent Consul re-registrations), this causes intermittentno servers availableeven when Consul reports healthy nodes.Missing
resolveflag onserverdirective (nginx-plus or OpenResty): Without theresolvekeyword, the upstream server address is resolved once. This is the single most common misconfiguration in Consul-integrated setups.
How to Fix It
Basic Fix — Variable-Based Proxy Pass (OSS Nginx)
OSS Nginx does not support the resolve flag on server directives. The only way to force runtime DNS re-resolution is to use a variable in proxy_pass. This forces Nginx to consult the resolver on every request (subject to valid= TTL).
- upstream consul_backend {
- server myservice.service.consul:8080;
- }
-
- server {
- location /api/ {
- proxy_pass http://consul_backend;
- }
- }
+ # Step 1: Define the Consul agent as the resolver
+ # Use your local Consul agent (127.0.0.1:8600) or dnsmasq forwarding port 53
+ resolver 127.0.0.1:8600 valid=10s ipv6=off;
+ resolver_timeout 5s;
+
+ server {
+ location /api/ {
+ # Step 2: Assign FQDN to a variable — this is what forces re-resolution
+ set $consul_upstream "myservice.service.consul";
+
+ proxy_pass http://$consul_upstream:8080;
+ proxy_connect_timeout 3s;
+ proxy_read_timeout 60s;
+
+ # Step 3: Retry on failure, but NOT on non-idempotent methods
+ proxy_next_upstream error timeout http_502 http_503;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream_timeout 10s;
+ }
+ }
⚠️ Critical: When you use
set $upstreamandproxy_pass http://$upstream, Nginx bypasses theupstream {}block entirely. You lose load balancing across multiple Consul instances. See the Enterprise fix below for the correct multi-instance pattern.
Enterprise Best Practice — OpenResty / Nginx Plus with resolve Flag + Keepalive + Zone
For production multi-instance setups, use Nginx Plus or OpenResty with lua-resty-dns to get proper dynamic upstream resolution with health checks.
- upstream consul_backend {
- server myservice.service.consul:8080;
- keepalive 32;
- }
+ # Nginx Plus / OpenResty configuration
+ upstream consul_backend {
+ # 'resolve' flag enables continuous DNS re-resolution via the resolver directive
+ server myservice.service.consul:8080 resolve;
+
+ # Shared memory zone: size for ~1000 peers = 128k minimum
+ zone consul_backend_zone 256k;
+
+ # Keepalive pool — reuse connections to Consul-registered instances
+ keepalive 64;
+ keepalive_requests 1000;
+ keepalive_timeout 60s;
+
+ # Hard fallback: a known-stable instance or a local error handler
+ # Remove if no static fallback exists; use least_conn for Consul-heavy traffic
+ least_conn;
+ }
+
+ # Resolver MUST be defined at http{} context for 'resolve' to work
+ resolver 127.0.0.1:8600 valid=5s ipv6=off;
+ resolver_timeout 3s;
+
+ server {
+ location /api/ {
+ proxy_pass http://consul_backend;
+
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+
+ proxy_connect_timeout 3s;
+ proxy_send_timeout 30s;
+ proxy_read_timeout 60s;
+
+ proxy_next_upstream error timeout http_502 http_503 http_504;
+ proxy_next_upstream_tries 3;
+ proxy_next_upstream_timeout 15s;
+
+ # Return 503 with a body instead of Nginx default 502 page
+ error_page 502 503 504 /upstream_unavailable.html;
+ }
+ }
Consul-side fix — ensure TTL health checks are tight:
{
"service": {
"name": "myservice",
"port": 8080,
"check": {
- "interval": "30s",
- "timeout": "10s"
+ "interval": "5s",
+ "timeout": "2s",
+ "deregister_critical_service_after": "30s"
}
}
}
The deregister_critical_service_after field is the critical one — without it, a dead instance stays in Consul's catalog indefinitely and Nginx keeps trying to route to it.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint Nginx Configs in Pre-Merge Checks
Use nginx -t in a Docker stage gate and crossplane for structured config parsing:
# In your CI pipeline (GitHub Actions / GitLab CI)
- name: Validate Nginx Config
run: |
docker run --rm \
-v $(pwd)/nginx:/etc/nginx:ro \
nginx:stable nginx -t
pip install crossplane
crossplane parse /etc/nginx/nginx.conf | jq '.errors | length == 0'
2. OPA Policy — Enforce resolver Directive When Consul FQDNs Are Present
package nginx.consul
deny[msg] {
# Detect .service.consul FQDNs in proxy_pass without a resolver directive
block := input.config.servers[_].locations[_]
contains(block.proxy_pass, ".service.consul")
not input.config.resolver
msg := "POLICY VIOLATION: proxy_pass to .service.consul FQDN requires a resolver directive with valid= TTL"
}
deny[msg] {
# Enforce variable-based proxy_pass for OSS Nginx (no 'resolve' flag support)
block := input.config.servers[_].locations[_]
contains(block.proxy_pass, ".service.consul")
not startswith(block.proxy_pass, "http://$")
not input.config.upstream_plus_resolve
msg := "POLICY VIOLATION: OSS Nginx requires variable-based proxy_pass for dynamic Consul resolution"
}
3. Consul Health Check Validation in Terraform
resource "consul_service" "myservice" {
name = "myservice"
port = 8080
check {
- interval = "30s"
+ interval = "5s"
+ timeout = "2s"
+ deregister_critical_service_after = "30s"
http = "http://localhost:8080/health"
}
}
4. Synthetic Canary Monitor
Deploy a Prometheus blackbox exporter probe that hits your Nginx upstream endpoint every 15 seconds and alerts if probe_http_status_code returns 502 for more than 30 seconds. This catches the Consul deregistration window before it becomes a full outage.
# prometheus/blackbox_rules.yml
- alert: NginxConsulUpstreamDead
expr: probe_http_status_code{job="nginx_consul_probe"} == 502
for: 30s
labels:
severity: critical
annotations:
summary: "Nginx Consul upstream returning 502 — no healthy servers available"
runbook: "https://wiki.internal/runbooks/nginx-consul-upstream"