Initializing Enclave...

How to Fix Nginx 'connect() failed (110: Connection timed out)' with AWS Network Load Balancer

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: Nginx cannot establish a TCP connection to the upstream target registered behind an AWS NLB — the socket hangs until the OS timeout fires (110 seconds default).
  • How to fix it: Align NLB idle timeout, Nginx proxy_connect_timeout/proxy_read_timeout, target group health check ports, and security group ingress rules. Enable NLB proxy protocol only if Nginx is configured to consume it.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your Nginx upstream block — paste your config, get a corrected diff without sending your IPs or ARNs to any external server.

The Incident (What Does This Error Mean?)

Raw error from /var/log/nginx/error.log:

2024/01/15 03:42:17 [error] 1832#1832: *58291 connect() failed (110: Connection timed out)
while connecting to upstream, client: 10.0.1.45, server: api.internal,
request: "POST /v2/process HTTP/1.1", upstream: "http://10.0.3.22:8080/v2/process",
host: "api.internal"

Errno 110 = ETIMEDOUT. The TCP SYN packet left Nginx, hit the NLB, was forwarded to the target, and never received a SYN-ACK. The connection was never established — this is not a slow response, it is a dead socket. Nginx waited proxy_connect_timeout seconds (default: 60s), got nothing, and logged this error. Every request hitting this upstream during that window is a hard 502.


The Attack Vector / Blast Radius

This is not a transient blip. Here is the cascading failure path:

  1. NLB does not terminate TCP. Unlike ALB, NLB passes raw TCP to the target. If the target's security group does not allow ingress from the NLB's node IPs (not the NLB's SG — NLBs have no SG in older deployments), the SYN is silently dropped by the target's SG. No RST. No ICMP unreachable. Pure silence.
  2. Nginx upstream keepalive pool exhausts. With keepalive 32 set, all 32 pooled connections can be in a timed-out state simultaneously. New requests queue behind them. Worker processes block. You hit worker_connections limits. Full service outage within minutes under load.
  3. Proxy Protocol mismatch is a silent killer. If NLB has Proxy Protocol v2 enabled on the target group but Nginx does not have listen ... proxy_protocol configured, Nginx receives the PP2 binary header as the first bytes of the HTTP request — it cannot parse it, the upstream handshake never completes, and you get errno 110.
  4. NLB idle timeout vs. Nginx keepalive timeout mismatch. NLB's default idle timeout is 350 seconds. If Nginx keepalive_timeout is set lower (e.g., 65s), Nginx closes the connection. NLB does not know this. It continues routing new requests to the now-dead connection tuple. The target sees a SYN on a port it considers closed → RST or timeout.

How to Fix It

Step 1 — Verify the Security Group (Most Common Root Cause)

NLB nodes source traffic from the NLB's node private IPs in each AZ, not from a single SG. For NLBs with client IP preservation enabled, traffic arrives from the client's IP.

# Get NLB node IPs across AZs
aws ec2 describe-network-interfaces \
  --filters Name=description,Values="ELB net/<your-nlb-name>/*" \
  --query 'NetworkInterfaces[*].PrivateIpAddress'

Add ingress rules to the target's SG for each NLB node IP on the application port. Or, if using VPC-native NLB (2023+), assign a security group to the NLB itself and allow that SG as the source.

Step 2 — Fix Nginx Upstream and Proxy Timeout Config

 upstream backend {
-    server 10.0.3.22:8080;
+    server 10.0.3.22:8080 max_fails=3 fail_timeout=10s;
+    keepalive 16;
+    keepalive_requests 1000;
+    keepalive_timeout 300s;  # Must be < NLB idle timeout (350s)
 }

 server {
     listen 80;
+    # If NLB Proxy Protocol v2 is enabled on target group:
+    # listen 80 proxy_protocol;

     location / {
         proxy_pass http://backend;
-        proxy_connect_timeout 60s;
-        proxy_read_timeout 60s;
-        proxy_send_timeout 60s;
+        proxy_connect_timeout 10s;
+        proxy_read_timeout 120s;
+        proxy_send_timeout 30s;
+        proxy_http_version 1.1;
+        proxy_set_header Connection "";
     }
 }

Why proxy_http_version 1.1 + Connection "": HTTP/1.0 forces Connection: close on every request, destroying keepalive. Without this pair, keepalive in the upstream block is useless.

Step 3 — Validate NLB Target Group Health Check

If the health check port differs from the traffic port, a target can be healthy in the TG while the application port is firewalled or not listening.

# Check registered targets and their health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:<region>:<acct>:targetgroup/<name>/<id>

Expected: "State": "healthy". If you see "State": "unused" or "State": "draining", the target is not receiving traffic regardless of Nginx config.

Enterprise Best Practice — Terraform-Enforced NLB Config

 resource "aws_lb_target_group" "api" {
   name        = "api-tg"
   port        = 8080
   protocol    = "TCP"
   vpc_id      = var.vpc_id

+  connection_termination = true
+  deregistration_delay   = 30
+  preserve_client_ip     = false  # Set true only if SG rules account for client IPs

   health_check {
     protocol            = "TCP"
     port                = "traffic-port"
-    healthy_threshold   = 10
-    unhealthy_threshold = 10
+    healthy_threshold   = 3
+    unhealthy_threshold = 3
+    interval            = 10
   }
 }

+resource "aws_lb_target_group_attachment" "proxy_protocol" {
+  # Enable only if Nginx is configured with proxy_protocol on listen directive
+  # aws_lb_target_group attribute: proxy_protocol_v2 = false (default, keep false unless intentional)
+}

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

Checkov — Catch Proxy Protocol Mismatch Pre-Deploy

# Install and scan your Terraform plan
pip install checkov
checkov -d ./terraform --check CKV_AWS_91,CKV_AWS_92

OPA Policy — Enforce keepalive_timeout < NLB Idle Timeout

package nginx.nlb

deny[msg] {
  input.upstream.keepalive_timeout_seconds >= 350
  msg := "upstream keepalive_timeout must be < NLB idle timeout (350s) to prevent stale connection routing"
}

deny[msg] {
  input.upstream.keepalive
  not input.proxy_http_version == "1.1"
  msg := "proxy_http_version 1.1 required when upstream keepalive is enabled"
}

GitHub Actions — Nginx Config Lint on PR

- name: Lint Nginx Config
  run: |
    docker run --rm -v ${{ github.workspace }}/nginx:/etc/nginx:ro \
      nginx:alpine nginx -t
    # Fail fast before any ECS/EC2 deployment

Runbook checklist before every NLB target group change:

  • SG ingress allows NLB node IPs on application port
  • Proxy Protocol setting matches on both NLB TG and Nginx listen directive
  • keepalive_timeout in Nginx upstream < 350s
  • proxy_http_version 1.1 + proxy_set_header Connection "" set in location block
  • Health check port = traffic port unless explicitly separated with matching SG rules

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →