Fixing Nginx WebSocket 'recv() failed (104: Connection reset by peer)' — Root Cause & Production Hardening Guide
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Nginx received TCP RST (errno 104) from the upstream server while proxying a WebSocket connection — the upstream killed the socket before the exchange completed, or Nginx never properly negotiated the HTTP→WebSocket upgrade.
- How to fix it: Set
proxy_read_timeout,proxy_send_timeout, add the mandatoryUpgradeandConnectionproxy headers, and ensure your upstream keepalive pool is correctly configured. - Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your
nginx.conforlocationblock and get a corrected diff without sending your config to any external server.
The Incident (What Does the Error Mean?)
You will see this in /var/log/nginx/error.log:
2024/01/15 03:42:17 [error] 1234#1234: *8901 recv() failed (104: Connection reset by peer)
while reading upstream response header from upstream,
client: 203.0.113.45, server: api.example.com,
request: "GET /ws/live-feed HTTP/1.1",
upstream: "http://127.0.0.1:8080/ws/live-feed",
host: "api.example.com"
errno 104 = ECONNRESET. The remote peer (your upstream app server — Node.js, Gunicorn, Go binary, etc.) sent a TCP RST packet. This is not a graceful FIN-based close. It is an abrupt socket termination. The client receives a 502 Bad Gateway or a dropped WebSocket frame with no clean close handshake.
Immediate consequence: Every active WebSocket subscriber on that upstream worker loses their connection simultaneously. If you are running Socket.IO, GraphQL subscriptions, or a live-data feed, all clients reconnect at the same instant — creating a thundering herd that can cascade into a full upstream meltdown.
The Attack Vector / Blast Radius
This is primarily a stability and availability failure, but it has a secondary security surface:
Cascading failure path:
- Upstream worker OOMs, crashes, or hits its own idle timeout and sends RST.
- Nginx logs 104, returns 502 to all connected WebSocket clients.
- All clients implement exponential backoff reconnect — but if the backoff floor is low (e.g., 1s), hundreds of clients hammer the upstream simultaneously.
- Upstream respawns under load, immediately OOMs again. Loop.
- Result: Full service brownout. CPU pegged. Load balancer health checks fail. Auto-scaling triggers but new instances inherit the same misconfigured timeout.
Why the misconfiguration makes it worse:
- Without
proxy_read_timeoutset high enough for long-lived WebSocket connections, Nginx itself will RST the upstream connection after the default 60 seconds of inactivity — even if the WebSocket is intentionally idle (waiting for a push event). - Without
proxy_http_version 1.1and theConnection: upgradeheader chain, Nginx downgrades to HTTP/1.0, which has no keepalive, guaranteeing RST on every request after the first. - A missing or incorrect
Upgradeheader means the upstream never switches protocols — it responds with HTTP 200 on a non-WebSocket handler and closes the connection immediately.
Security angle: An attacker who knows your WebSocket endpoint is misconfigured can send a high-frequency reconnect flood. Because each reconnect is a full HTTP upgrade handshake, it is significantly more expensive than a plain TCP SYN flood and bypasses many L4 rate limiters.
How to Fix It
Basic Fix — Correct the WebSocket Proxy Block
The minimum viable fix for a single upstream WebSocket location:
http {
+ map $http_upgrade $connection_upgrade {
+ default upgrade;
+ '' close;
+ }
server {
listen 443 ssl;
server_name api.example.com;
location /ws/ {
proxy_pass http://ws_upstream;
- proxy_http_version 1.0;
+ proxy_http_version 1.1;
- # Missing upgrade headers — upstream never switches protocols
+ proxy_set_header Upgrade $http_upgrade;
+ proxy_set_header Connection $connection_upgrade;
+ proxy_set_header Host $host;
+ proxy_set_header X-Real-IP $remote_addr;
+ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
- proxy_read_timeout 60s;
- proxy_send_timeout 60s;
+ proxy_read_timeout 3600s;
+ proxy_send_timeout 3600s;
+ proxy_connect_timeout 10s;
+ # Disable buffering — WebSocket frames must not be buffered
+ proxy_buffering off;
+ proxy_cache off;
}
}
}
Enterprise Best Practice — Upstream Keepalive Pool + Health Checks
For production environments with multiple upstream workers (Node cluster, PM2, Gunicorn with eventlet, Go net/http):
http {
+ map $http_upgrade $connection_upgrade {
+ default upgrade;
+ '' close;
+ }
upstream ws_upstream {
- server 127.0.0.1:8080;
- server 127.0.0.1:8081;
+ server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
+ server 127.0.0.1:8081 max_fails=3 fail_timeout=30s;
+ # Keepalive pool — reuse TCP connections to upstream workers
+ # Prevents RST caused by connection exhaustion under load
+ keepalive 64;
+ keepalive_requests 10000;
+ keepalive_timeout 75s;
}
server {
listen 443 ssl;
+ # Active health checks (requires nginx-plus or use lua-resty-healthcheck for OSS)
+ # location /nginx-health {
+ # health_check interval=5s fails=2 passes=3 uri=/healthz;
+ # }
location /ws/ {
proxy_pass http://ws_upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
proxy_connect_timeout 10s;
proxy_buffering off;
proxy_cache off;
+ # Prevent Nginx from closing idle-but-valid WebSocket connections
+ proxy_socket_keepalive on;
+ # Limit concurrent WebSocket connections per IP to prevent RST flood abuse
+ limit_conn ws_conn_zone 50;
}
}
+ # Define connection zone for rate limiting
+ limit_conn_zone $binary_remote_addr zone=ws_conn_zone:10m;
}
Upstream-side fix (Node.js example): Ensure your upstream server does not set an idleTimeout shorter than Nginx's proxy_read_timeout:
const server = http.createServer(app);
-server.keepAliveTimeout = 5000; // 5s — shorter than Nginx default 60s = RST
+server.keepAliveTimeout = 72000; // 72s — must exceed proxy_read_timeout
+server.headersTimeout = 75000; // Must be > keepAliveTimeout
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
Stop this from reaching production again with these enforcement layers:
1. gixy — Static Nginx Security & Config Linter
# Install and run in your CI pipeline
pip install gixy
gixy /etc/nginx/nginx.conf
# Catches: missing proxy headers, insecure timeout configs, buffer misconfigs
Add to your GitHub Actions workflow:
- name: Lint Nginx Config
run: |
docker run --rm -v $(pwd)/nginx:/etc/nginx yandex/gixy /etc/nginx/nginx.conf
2. nginx -t in Pre-Commit + Deployment Gate
# .github/workflows/deploy.yml
- name: Validate Nginx Config Syntax
run: nginx -t -c ${{ github.workspace }}/nginx/nginx.conf
# Fails the pipeline on any config error before deployment
3. Checkov — IaC Policy for Nginx Terraform/Helm Deployments
If you manage Nginx via Helm or Terraform, add a custom Checkov check:
# checkov/custom_checks/nginx_ws_timeout.py
from checkov.common.models.enums import CheckResult, CheckCategories
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class NginxWebSocketTimeoutCheck(BaseResourceCheck):
def __init__(self):
name = "Ensure Nginx proxy_read_timeout >= 3600 for WebSocket locations"
id = "CKV_CUSTOM_NGINX_001"
supported_resources = ['helm_release']
super().__init__(name=name, id=id, categories=[CheckCategories.NETWORKING],
supported_resources=supported_resources)
4. Prometheus Alert — Catch 104 Errors Before Users Do
# prometheus/alerts/nginx.yml
groups:
- name: nginx_websocket
rules:
- alert: NginxUpstreamConnectionReset
expr: increase(nginx_upstream_connect_errors_total[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Nginx upstream RST rate elevated — check WebSocket proxy config"
runbook: "https://wiki.internal/runbooks/nginx-104-econnreset"
5. Integration Test — Simulate Long-Lived WebSocket in CI
#!/bin/bash
# ci/test_websocket_timeout.sh
# Requires: websocat
websocat --ping-interval 30 --ping-timeout 10 \
wss://staging.api.example.com/ws/live-feed &
WS_PID=$!
sleep 120 # Hold connection for 2 minutes
kill $WS_PID
# If websocat exits with non-zero before 120s, the test fails — timeout misconfigured
Run this in your staging smoke test suite. A properly configured Nginx+upstream stack will hold this connection for the full 120 seconds without a 104 error.