Initializing Enclave...

How to Fix Nginx 'upstream failed (111: Connection refused)' After a Security Group Change

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–15 mins


TL;DR

  • What broke: A security group (or host firewall) rule change dropped TCP access from the Nginx host to the backend on port 8080, causing immediate errno 111: Connection refused on every proxied request.
  • How to fix it: Restore an inbound rule on the backend's security group allowing TCP 8080 from the Nginx host's private IP or security group ID. Verify with nc -zv 10.0.0.1 8080 before touching Nginx.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your security group JSON or Nginx upstream config — secrets never leave your browser.

The Incident (What Does the Error Mean?)

Raw error from /var/log/nginx/error.log:

2024/01/15 03:42:17 [error] 1234#1234: *8821 connect() failed (111: Connection refused)
while connecting to upstream, client: 203.0.113.45, server: api.example.com,
request: "POST /api/v2/orders HTTP/1.1",
upstream: "http://10.0.0.1:8080/api/v2/orders",
 host: "api.example.com"

errno 111 = ECONNREFUSED. The TCP SYN packet from Nginx reached the destination host (10.0.0.1) but was actively refused — either the port is not listening, or a stateful firewall (security group, iptables, nftables) is dropping/rejecting the packet before it hits the application. Since this appeared immediately after a security group change, the application process itself is almost certainly still running. The firewall is the culprit.

Immediate consequence: Every request Nginx proxies to this upstream returns 502 Bad Gateway to end users. If upstream is a single server with no fallback, your service is 100% down.


The Attack Vector / Blast Radius

This is a misconfiguration-induced outage, not an active exploit — but the blast radius is severe and the security implications cut both ways:

Outage vector: Security group rules are stateful and evaluated in order. A single overly broad "deny" rule, a removed allow rule, or a CIDR tightening (e.g., /0/32 with the wrong IP) silently kills all backend traffic. Nginx has no circuit breaker by default — it hammers the refused port on every request, saturating connection queues and spiking error rates instantly.

The dangerous inverse scenario — why you can't just re-open everything: The temptation during an outage is to open 0.0.0.0/0 on port 8080 to restore service fast. Do not do this. Port 8080 running an internal app server (Tomcat, Gunicorn, Node) is almost never hardened for public exposure. It likely:

  • Has no TLS
  • Exposes internal API endpoints without auth
  • Leaks stack traces and internal hostnames
  • Is running as a service user with broad filesystem access

An attacker scanning for open 8080s (Shodan indexes millions) gets direct backend access, bypassing Nginx auth, WAF rules, and rate limiting entirely.

Cascading failure risk: If Nginx is configured with keepalive and multiple workers, all worker processes begin queuing failed connect() attempts. Under load, this exhausts the Nginx worker connection pool, causing the proxy to start refusing new connections at the edge — a full cascading failure from a single firewall rule.


How to Fix It

Step 0: Confirm the actual problem before touching anything

# Run FROM the Nginx host, not your laptop
nc -zv 10.0.0.1 8080
# Expected broken output: nc: connect to 10.0.0.1 port 8080 (tcp) failed: Connection refused
# Expected fixed output:  Connection to 10.0.0.1 8080 port [tcp/*] succeeded!

# Also verify the backend process is actually running on the target host
ssh 10.0.0.1 'ss -tlnp | grep 8080'

If ss shows nothing on 8080, your app process crashed — the security group is a red herring. If it shows LISTEN, the firewall is definitively the issue.


Basic Fix — AWS Security Group (Console or CLI)

Add an inbound rule to the backend instance's security group allowing TCP 8080 from the Nginx instance's security group ID (not a CIDR — SG-to-SG rules don't break when IPs change).

# AWS Security Group: sg-0backend123 (attached to 10.0.0.1)
# Inbound Rules

- Type: Custom TCP  Port: 8080  Source: 0.0.0.0/0   # What someone panic-added before
+ Type: Custom TCP  Port: 8080  Source: sg-0nginx456  # Nginx instance's security group ID

AWS CLI equivalent:

# Remove the overly permissive rule
aws ec2 revoke-security-group-ingress \
  --group-id sg-0backend123 \
  --protocol tcp --port 8080 \
  --cidr 0.0.0.0/0

# Add the scoped rule
aws ec2 authorize-security-group-ingress \
  --group-id sg-0backend123 \
  --protocol tcp --port 8080 \
  --source-group sg-0nginx456

Enterprise Best Practice — Terraform with Least-Privilege SG Rule

# terraform/modules/backend/security_groups.tf

 resource "aws_security_group_rule" "backend_from_nginx" {
   type                     = "ingress"
   from_port                = 8080
   to_port                  = 8080
   protocol                 = "tcp"
   security_group_id        = aws_security_group.backend.id
-  cidr_blocks              = ["10.0.0.0/8"]  # Too broad — entire RFC1918 range
+  source_security_group_id = aws_security_group.nginx.id  # Scoped to Nginx SG only
+  description              = "Allow Nginx proxy to reach app backend on 8080"
 }
# nginx/conf.d/upstream.conf

 upstream backend_pool {
+  keepalive 32;
   server 10.0.0.1:8080 max_fails=3 fail_timeout=10s;
+  server 10.0.0.2:8080 max_fails=3 fail_timeout=10s backup;  # Add a backup upstream
 }

 server {
   location /api/ {
     proxy_pass http://backend_pool;
+    proxy_connect_timeout 3s;   # Fail fast, don't queue
+    proxy_next_upstream error timeout http_502;
   }
 }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov — Block public exposure of internal ports at plan time

# .checkov.yml
checks:
  - CKV_AWS_25   # Ensure no security group allows ingress from 0.0.0.0/0 on sensitive ports
  - CKV_AWS_277  # Ensure no security group allows unrestricted ingress on port 8080

Run in your pipeline:

checkov -d ./terraform --framework terraform --check CKV_AWS_25,CKV_AWS_277 --hard-fail-on HIGH

2. OPA/Conftest Policy — Enforce SG-to-SG rules only

# policies/no_cidr_ingress_on_app_ports.rego
package terraform.aws.security_group

deny[msg] {
  rule := input.resource.aws_security_group_rule[_]
  rule.config.from_port == 8080
  rule.config.cidr_blocks != null
  msg := sprintf("SG rule on port 8080 must use source_security_group_id, not cidr_blocks. Found: %v", [rule.config.cidr_blocks])
}

3. Nginx Upstream Monitoring — Catch this in seconds, not minutes

# Add to your alerting stack (Prometheus + Alertmanager)
- alert: NginxUpstreamDown
  expr: nginx_upstream_peers_active{state="unavailable"} > 0
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Nginx upstream {{ $labels.upstream }} has unavailable peers"

4. Terraform State Drift Detection

Run terraform plan in a scheduled CI job (not just on PR). Security group drift from console changes is the #1 cause of this class of outage. Use AWS Config Rule restricted-common-ports as a compensating control.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →