Initializing Enclave...

Fixing Nginx WebSocket 'recv() failed (104: Connection reset by peer)' — Root Cause & Production Hardening Guide

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: Nginx received TCP RST (errno 104) from the upstream server while proxying a WebSocket connection — the upstream killed the socket before the exchange completed, or Nginx never properly negotiated the HTTP→WebSocket upgrade.
  • How to fix it: Set proxy_read_timeout, proxy_send_timeout, add the mandatory Upgrade and Connection proxy headers, and ensure your upstream keepalive pool is correctly configured.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your nginx.conf or location block and get a corrected diff without sending your config to any external server.

The Incident (What Does the Error Mean?)

You will see this in /var/log/nginx/error.log:

2024/01/15 03:42:17 [error] 1234#1234: *8901 recv() failed (104: Connection reset by peer)
while reading upstream response header from upstream,
client: 203.0.113.45, server: api.example.com,
request: "GET /ws/live-feed HTTP/1.1",
upstream: "http://127.0.0.1:8080/ws/live-feed",
host: "api.example.com"

errno 104 = ECONNRESET. The remote peer (your upstream app server — Node.js, Gunicorn, Go binary, etc.) sent a TCP RST packet. This is not a graceful FIN-based close. It is an abrupt socket termination. The client receives a 502 Bad Gateway or a dropped WebSocket frame with no clean close handshake.

Immediate consequence: Every active WebSocket subscriber on that upstream worker loses their connection simultaneously. If you are running Socket.IO, GraphQL subscriptions, or a live-data feed, all clients reconnect at the same instant — creating a thundering herd that can cascade into a full upstream meltdown.


The Attack Vector / Blast Radius

This is primarily a stability and availability failure, but it has a secondary security surface:

Cascading failure path:

  1. Upstream worker OOMs, crashes, or hits its own idle timeout and sends RST.
  2. Nginx logs 104, returns 502 to all connected WebSocket clients.
  3. All clients implement exponential backoff reconnect — but if the backoff floor is low (e.g., 1s), hundreds of clients hammer the upstream simultaneously.
  4. Upstream respawns under load, immediately OOMs again. Loop.
  5. Result: Full service brownout. CPU pegged. Load balancer health checks fail. Auto-scaling triggers but new instances inherit the same misconfigured timeout.

Why the misconfiguration makes it worse:

  • Without proxy_read_timeout set high enough for long-lived WebSocket connections, Nginx itself will RST the upstream connection after the default 60 seconds of inactivity — even if the WebSocket is intentionally idle (waiting for a push event).
  • Without proxy_http_version 1.1 and the Connection: upgrade header chain, Nginx downgrades to HTTP/1.0, which has no keepalive, guaranteeing RST on every request after the first.
  • A missing or incorrect Upgrade header means the upstream never switches protocols — it responds with HTTP 200 on a non-WebSocket handler and closes the connection immediately.

Security angle: An attacker who knows your WebSocket endpoint is misconfigured can send a high-frequency reconnect flood. Because each reconnect is a full HTTP upgrade handshake, it is significantly more expensive than a plain TCP SYN flood and bypasses many L4 rate limiters.


How to Fix It

Basic Fix — Correct the WebSocket Proxy Block

The minimum viable fix for a single upstream WebSocket location:

 http {
+    map $http_upgrade $connection_upgrade {
+        default upgrade;
+        ''      close;
+    }
 
     server {
         listen 443 ssl;
         server_name api.example.com;
 
         location /ws/ {
             proxy_pass http://ws_upstream;
-            proxy_http_version 1.0;
+            proxy_http_version 1.1;
 
-            # Missing upgrade headers — upstream never switches protocols
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection $connection_upgrade;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 
-            proxy_read_timeout 60s;
-            proxy_send_timeout 60s;
+            proxy_read_timeout 3600s;
+            proxy_send_timeout 3600s;
+            proxy_connect_timeout 10s;
 
+            # Disable buffering — WebSocket frames must not be buffered
+            proxy_buffering off;
+            proxy_cache off;
         }
     }
 }

Enterprise Best Practice — Upstream Keepalive Pool + Health Checks

For production environments with multiple upstream workers (Node cluster, PM2, Gunicorn with eventlet, Go net/http):

 http {
 
+    map $http_upgrade $connection_upgrade {
+        default upgrade;
+        ''      close;
+    }
 
     upstream ws_upstream {
-        server 127.0.0.1:8080;
-        server 127.0.0.1:8081;
+        server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
+        server 127.0.0.1:8081 max_fails=3 fail_timeout=30s;
 
+        # Keepalive pool — reuse TCP connections to upstream workers
+        # Prevents RST caused by connection exhaustion under load
+        keepalive 64;
+        keepalive_requests 10000;
+        keepalive_timeout 75s;
     }
 
     server {
         listen 443 ssl;
 
+        # Active health checks (requires nginx-plus or use lua-resty-healthcheck for OSS)
+        # location /nginx-health {
+        #     health_check interval=5s fails=2 passes=3 uri=/healthz;
+        # }
 
         location /ws/ {
             proxy_pass http://ws_upstream;
             proxy_http_version 1.1;
 
             proxy_set_header Upgrade $http_upgrade;
             proxy_set_header Connection $connection_upgrade;
             proxy_set_header Host $host;
             proxy_set_header X-Real-IP $remote_addr;
             proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
 
             proxy_read_timeout 3600s;
             proxy_send_timeout 3600s;
             proxy_connect_timeout 10s;
 
             proxy_buffering off;
             proxy_cache off;
 
+            # Prevent Nginx from closing idle-but-valid WebSocket connections
+            proxy_socket_keepalive on;
 
+            # Limit concurrent WebSocket connections per IP to prevent RST flood abuse
+            limit_conn ws_conn_zone 50;
         }
     }
 
+    # Define connection zone for rate limiting
+    limit_conn_zone $binary_remote_addr zone=ws_conn_zone:10m;
 }

Upstream-side fix (Node.js example): Ensure your upstream server does not set an idleTimeout shorter than Nginx's proxy_read_timeout:

 const server = http.createServer(app);
-server.keepAliveTimeout = 5000;  // 5s — shorter than Nginx default 60s = RST
+server.keepAliveTimeout = 72000; // 72s — must exceed proxy_read_timeout
+server.headersTimeout = 75000;   // Must be > keepAliveTimeout

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

Stop this from reaching production again with these enforcement layers:

1. gixy — Static Nginx Security & Config Linter

# Install and run in your CI pipeline
pip install gixy
gixy /etc/nginx/nginx.conf
# Catches: missing proxy headers, insecure timeout configs, buffer misconfigs

Add to your GitHub Actions workflow:

- name: Lint Nginx Config
  run: |
    docker run --rm -v $(pwd)/nginx:/etc/nginx yandex/gixy /etc/nginx/nginx.conf

2. nginx -t in Pre-Commit + Deployment Gate

# .github/workflows/deploy.yml
- name: Validate Nginx Config Syntax
  run: nginx -t -c ${{ github.workspace }}/nginx/nginx.conf
  # Fails the pipeline on any config error before deployment

3. Checkov — IaC Policy for Nginx Terraform/Helm Deployments

If you manage Nginx via Helm or Terraform, add a custom Checkov check:

# checkov/custom_checks/nginx_ws_timeout.py
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class NginxWebSocketTimeoutCheck(BaseResourceCheck):
    def __init__(self):
        name = "Ensure Nginx proxy_read_timeout >= 3600 for WebSocket locations"
        id = "CKV_CUSTOM_NGINX_001"
        supported_resources = ['helm_release']
        super().__init__(name=name, id=id, categories=[CheckCategories.NETWORKING],
                         supported_resources=supported_resources)

4. Prometheus Alert — Catch 104 Errors Before Users Do

# prometheus/alerts/nginx.yml
groups:
  - name: nginx_websocket
    rules:
      - alert: NginxUpstreamConnectionReset
        expr: increase(nginx_upstream_connect_errors_total[5m]) > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx upstream RST rate elevated — check WebSocket proxy config"
          runbook: "https://wiki.internal/runbooks/nginx-104-econnreset"

5. Integration Test — Simulate Long-Lived WebSocket in CI

#!/bin/bash
# ci/test_websocket_timeout.sh
# Requires: websocat
websocat --ping-interval 30 --ping-timeout 10 \
  wss://staging.api.example.com/ws/live-feed &
WS_PID=$!
sleep 120  # Hold connection for 2 minutes
kill $WS_PID
# If websocat exits with non-zero before 120s, the test fails — timeout misconfigured

Run this in your staging smoke test suite. A properly configured Nginx+upstream stack will hold this connection for the full 120 seconds without a 104 error.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →