How to Fix Nginx 'recv() failed (104: Connection reset by peer)' When Proxying Large POST Requests
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins
TL;DR
- What broke: Nginx closed the upstream TCP connection before the backend finished reading the full POST body or writing the response — triggered by undersized buffer directives and aggressive default timeouts.
- How to fix it: Raise
proxy_read_timeout,proxy_send_timeout, setproxy_request_buffering off(or size buffers correctly), and bumpclient_max_body_sizeto match your payload ceiling. - Fast path: Use our Client-Side Sandbox above to drop your
nginx.confand auto-refactor every relevant directive — secrets stay in your browser, never leave your machine.
The Incident (What Does the Error Mean?)
Your Nginx error log shows:
2024/01/15 03:42:17 [error] 1234#1234: *5678 recv() failed (104: Connection reset by peer)
while reading response header from upstream,
client: 10.0.1.45, server: api.example.com,
request: "POST /api/v2/upload HTTP/1.1",
upstream: "http://10.0.2.10:8080/api/v2/upload",
host: "api.example.com"
Error code 104 = ECONNRESET. The upstream (your app server — Node, Gunicorn, uWSGI, whatever) forcibly closed the TCP socket. This happens in two distinct failure modes:
- Nginx is still sending the request body to upstream when upstream's own read timeout fires and it resets the connection.
- Nginx buffered the entire request body to disk/memory, then forwarded it, but
proxy_read_timeoutexpired before the backend finished processing and sent back a single byte of response headers.
The client gets a 502 Bad Gateway. The upload is lost. If this is a payment processor callback, a webhook, or a file ingestion pipeline — that data is gone.
The Attack Vector / Blast Radius
This is not a security vulnerability — it is a cascading availability failure with a predictable blast radius:
- Retry storms: Clients retry the failed POST. Each retry re-queues a large body. Under load, this saturates upstream worker threads simultaneously, compounding the original timeout.
- Disk I/O spike: With default
proxy_request_buffering on, Nginx writes the entire body to/var/cache/nginxtemp files. A flood of large POSTs fills the temp directory partition. When/varfills, Nginx itself stops logging and can begin refusing connections. - Silent data loss: If the upstream is a stateful write (database insert, S3 upload initiation, ledger entry), the partial-write scenario leaves orphaned records. The client received a 502, so it retries — you now have duplicate processing risk unless your backend is idempotent.
- Worker exhaustion: Each hung proxy connection holds an Nginx worker slot. With default
worker_connections 1024, a sustained upload storm of slow clients can exhaust all workers, taking down every other endpoint on the same Nginx instance.
How to Fix It (The Solution)
Basic Fix
The minimum viable change for most deployments. Apply to your location block handling the large POST endpoint.
http {
- client_max_body_size 1m;
+ client_max_body_size 512m;
server {
location /api/v2/upload {
proxy_pass http://backend_upstream;
- # No timeout overrides — using Nginx defaults (60s read, 60s send)
+ proxy_read_timeout 300s;
+ proxy_send_timeout 300s;
+ proxy_connect_timeout 10s;
- # Default: buffering on, small buffers
+ proxy_request_buffering off;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
}
}
}
proxy_request_buffering off is the single highest-impact change. It makes Nginx stream the request body directly to upstream in real time instead of spooling it first. The upstream starts reading immediately, eliminating the race condition between Nginx's spool-then-forward delay and the upstream's read timeout.
proxy_http_version 1.1 + Connection "" enables keepalive to the upstream pool, eliminating per-request TCP handshake overhead that compounds under large-body load.
Enterprise Best Practice
For production systems handling multi-hundred-MB payloads, file uploads, or high-throughput webhook ingestion:
http {
- client_max_body_size 1m;
- proxy_buffers 4 4k;
- proxy_buffer_size 4k;
- proxy_busy_buffers_size 8k;
+ client_max_body_size 1024m;
+ client_body_buffer_size 128k; # buffer small bodies in RAM, not disk
+ client_body_temp_path /dev/shm/nginx_tmp; # use tmpfs, not slow disk
+ proxy_buffers 16 16k;
+ proxy_buffer_size 32k;
+ proxy_busy_buffers_size 64k;
upstream backend_upstream {
server 10.0.2.10:8080;
+ keepalive 32; # maintain persistent connections to upstream
+ keepalive_requests 1000;
+ keepalive_timeout 65s;
}
server {
location /api/v2/upload {
proxy_pass http://backend_upstream;
- proxy_read_timeout 60s;
- proxy_send_timeout 60s;
+ proxy_read_timeout 600s; # accommodate slow processing backends
+ proxy_send_timeout 600s;
+ proxy_connect_timeout 5s;
+ proxy_request_buffering off;
+ proxy_http_version 1.1;
+ proxy_set_header Connection "";
+ proxy_set_header X-Request-ID $request_id;
+ # Limit upload rate to prevent worker exhaustion
+ limit_req zone=upload_zone burst=20 nodelay;
+ limit_req_status 429;
}
}
+ limit_req_zone $binary_remote_addr zone=upload_zone:10m rate=5r/s;
}
Critical tuning notes:
/dev/shmfor temp path: Writes body buffers to RAM-backed tmpfs instead of spinning disk or even NVMe. Eliminates I/O wait during concurrent large uploads.limit_req_zone: Without rate limiting, a single client can open 100 concurrent large-body uploads and exhaust all worker connections. 5 req/s per IP with burst 20 is a reasonable starting point — tune to your traffic profile.keepalive 32on upstream: Without this, every proxied request opens a new TCP connection to your app server. Under load this causes upstream port exhaustion (TIME_WAITaccumulation).
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
This misconfiguration is trivially detectable before it hits production. Add these gates:
1. gixy — Nginx static analysis (add to your pipeline)
# Install
pip install gixy
# Run in CI against your rendered nginx.conf
gixy /etc/nginx/nginx.conf
# gixy flags missing client_max_body_size and proxy timeout misconfigs
2. Checkov for Nginx-as-code (if using Terraform templatefile or Helm charts)
checkov -d ./nginx-configs --framework nginx
3. GitHub Actions gate — fail PR if client_max_body_size is default or proxy_read_timeout < 120s for upload endpoints:
# .github/workflows/nginx-lint.yml
jobs:
nginx-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run gixy
run: |
pip install gixy
gixy nginx/nginx.conf --output json | tee gixy-report.json
# Fail if any HIGH severity findings
python -c "
import json,sys
r=json.load(open('gixy-report.json'))
highs=[i for i in r.get('issues',[]) if i['severity']=='HIGH']
sys.exit(1 if highs else 0)
"
4. Load test before every release touching upload paths:
# k6 smoke test — send 50MB POST, assert no 502s
k6 run --vus 10 --duration 30s - <<'EOF'
import http from 'k6/http';
import { check } from 'k6';
export default function () {
const payload = open('/tmp/50mb_test.bin', 'b');
const res = http.post('https://api.example.com/api/v2/upload', payload);
check(res, { 'status is 200': (r) => r.status === 200 });
}
EOF
If your k6 run surfaces 502s at >50MB payloads in staging, the PR does not merge. No exceptions.