Initializing Enclave...

Fixing 'No Server Is Available for Upstream' in Nginx Dynamic Upstream with Consul DNS Resolver

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: Nginx's dynamic upstream resolver lost all healthy Consul endpoints — either DNS TTL expired with no valid= override, the resolver directive is missing/misconfigured, or all Consul service instances deregistered and Nginx has no fallback.
  • How to fix it: Add a resolver directive pointing at your Consul agent, set valid= TTL, use set $upstream variable-based proxying to force runtime DNS re-resolution, and configure proxy_next_upstream with a fallback.
  • Sandbox: Use our Client-Side Sandbox below to auto-refactor your failing Nginx config — paste it in and get the corrected upstream block without sending your internal hostnames anywhere.

The Incident (What Does the Error Mean?)

You will see this in /var/log/nginx/error.log:

2024/01/15 03:42:17 [error] 1234#1234: *58291 no servers are available while connecting to upstream,
client: 10.0.1.45, server: api.internal, request: "GET /health HTTP/1.1",
upstream: "consul_backend", host: "api.internal"

Nginx evaluated the upstream group named consul_backend (or equivalent) at request time and found zero resolvable, reachable peers. This is a hard 502. Every in-flight request to this upstream fails instantly — there is no retry to another backend, no queue, no graceful degradation. Your service is down.


The Attack Vector / Blast Radius

This is a cascading availability failure, not a single-service blip:

  1. DNS TTL expiry with no re-resolution: If you used a static upstream {} block with a Consul FQDN (e.g., server myservice.service.consul), Nginx resolves this once at startup. When Consul rotates IPs after a deployment or health-check failure, Nginx is still pointing at stale, dead IPs. The resolver directive with valid= is the only escape hatch.

  2. Consul health check cascade: A rolling deployment deregisters old instances before new ones pass their health checks. The window — even 10–30 seconds — is enough for Nginx to have zero valid upstreams. Without proxy_next_upstream error timeout and a backup server or keepalive pool, every request in that window hard-fails.

  3. Shared memory zone exhaustion: If zone consul_zone 64k is undersized for the number of dynamic peers being tracked, Nginx silently drops peer state. Under high upstream churn (frequent Consul re-registrations), this causes intermittent no servers available even when Consul reports healthy nodes.

  4. Missing resolve flag on server directive (nginx-plus or OpenResty): Without the resolve keyword, the upstream server address is resolved once. This is the single most common misconfiguration in Consul-integrated setups.


How to Fix It

Basic Fix — Variable-Based Proxy Pass (OSS Nginx)

OSS Nginx does not support the resolve flag on server directives. The only way to force runtime DNS re-resolution is to use a variable in proxy_pass. This forces Nginx to consult the resolver on every request (subject to valid= TTL).

- upstream consul_backend {
-     server myservice.service.consul:8080;
- }
-
- server {
-     location /api/ {
-         proxy_pass http://consul_backend;
-     }
- }

+ # Step 1: Define the Consul agent as the resolver
+ # Use your local Consul agent (127.0.0.1:8600) or dnsmasq forwarding port 53
+ resolver 127.0.0.1:8600 valid=10s ipv6=off;
+ resolver_timeout 5s;
+
+ server {
+     location /api/ {
+         # Step 2: Assign FQDN to a variable — this is what forces re-resolution
+         set $consul_upstream "myservice.service.consul";
+
+         proxy_pass          http://$consul_upstream:8080;
+         proxy_connect_timeout 3s;
+         proxy_read_timeout    60s;
+
+         # Step 3: Retry on failure, but NOT on non-idempotent methods
+         proxy_next_upstream error timeout http_502 http_503;
+         proxy_next_upstream_tries 3;
+         proxy_next_upstream_timeout 10s;
+     }
+ }

⚠️ Critical: When you use set $upstream and proxy_pass http://$upstream, Nginx bypasses the upstream {} block entirely. You lose load balancing across multiple Consul instances. See the Enterprise fix below for the correct multi-instance pattern.


Enterprise Best Practice — OpenResty / Nginx Plus with resolve Flag + Keepalive + Zone

For production multi-instance setups, use Nginx Plus or OpenResty with lua-resty-dns to get proper dynamic upstream resolution with health checks.

- upstream consul_backend {
-     server myservice.service.consul:8080;
-     keepalive 32;
- }

+ # Nginx Plus / OpenResty configuration
+ upstream consul_backend {
+     # 'resolve' flag enables continuous DNS re-resolution via the resolver directive
+     server myservice.service.consul:8080 resolve;
+
+     # Shared memory zone: size for ~1000 peers = 128k minimum
+     zone consul_backend_zone 256k;
+
+     # Keepalive pool — reuse connections to Consul-registered instances
+     keepalive 64;
+     keepalive_requests 1000;
+     keepalive_timeout 60s;
+
+     # Hard fallback: a known-stable instance or a local error handler
+     # Remove if no static fallback exists; use least_conn for Consul-heavy traffic
+     least_conn;
+ }
+
+ # Resolver MUST be defined at http{} context for 'resolve' to work
+ resolver 127.0.0.1:8600 valid=5s ipv6=off;
+ resolver_timeout 3s;
+
+ server {
+     location /api/ {
+         proxy_pass http://consul_backend;
+
+         proxy_http_version 1.1;
+         proxy_set_header Connection "";
+
+         proxy_connect_timeout 3s;
+         proxy_send_timeout    30s;
+         proxy_read_timeout    60s;
+
+         proxy_next_upstream       error timeout http_502 http_503 http_504;
+         proxy_next_upstream_tries 3;
+         proxy_next_upstream_timeout 15s;
+
+         # Return 503 with a body instead of Nginx default 502 page
+         error_page 502 503 504 /upstream_unavailable.html;
+     }
+ }

Consul-side fix — ensure TTL health checks are tight:

  {
    "service": {
      "name": "myservice",
      "port": 8080,
      "check": {
-       "interval": "30s",
-       "timeout": "10s"
+       "interval": "5s",
+       "timeout": "2s",
+       "deregister_critical_service_after": "30s"
      }
    }
  }

The deregister_critical_service_after field is the critical one — without it, a dead instance stays in Consul's catalog indefinitely and Nginx keeps trying to route to it.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint Nginx Configs in Pre-Merge Checks

Use nginx -t in a Docker stage gate and crossplane for structured config parsing:

# In your CI pipeline (GitHub Actions / GitLab CI)
- name: Validate Nginx Config
  run: |
    docker run --rm \
      -v $(pwd)/nginx:/etc/nginx:ro \
      nginx:stable nginx -t
    pip install crossplane
    crossplane parse /etc/nginx/nginx.conf | jq '.errors | length == 0'

2. OPA Policy — Enforce resolver Directive When Consul FQDNs Are Present

package nginx.consul

deny[msg] {
    # Detect .service.consul FQDNs in proxy_pass without a resolver directive
    block := input.config.servers[_].locations[_]
    contains(block.proxy_pass, ".service.consul")
    not input.config.resolver
    msg := "POLICY VIOLATION: proxy_pass to .service.consul FQDN requires a resolver directive with valid= TTL"
}

deny[msg] {
    # Enforce variable-based proxy_pass for OSS Nginx (no 'resolve' flag support)
    block := input.config.servers[_].locations[_]
    contains(block.proxy_pass, ".service.consul")
    not startswith(block.proxy_pass, "http://$")
    not input.config.upstream_plus_resolve
    msg := "POLICY VIOLATION: OSS Nginx requires variable-based proxy_pass for dynamic Consul resolution"
}

3. Consul Health Check Validation in Terraform

 resource "consul_service" "myservice" {
   name = "myservice"
   port = 8080

   check {
-    interval = "30s"
+    interval = "5s"
+    timeout  = "2s"
+    deregister_critical_service_after = "30s"
     http     = "http://localhost:8080/health"
   }
 }

4. Synthetic Canary Monitor

Deploy a Prometheus blackbox exporter probe that hits your Nginx upstream endpoint every 15 seconds and alerts if probe_http_status_code returns 502 for more than 30 seconds. This catches the Consul deregistration window before it becomes a full outage.

# prometheus/blackbox_rules.yml
- alert: NginxConsulUpstreamDead
  expr: probe_http_status_code{job="nginx_consul_probe"} == 502
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Nginx Consul upstream returning 502 — no healthy servers available"
    runbook: "https://wiki.internal/runbooks/nginx-consul-upstream"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →