Initializing Enclave...

How to Fix Nginx 'recv() failed (104: Connection reset by peer)' When Proxying Large POST Requests

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins


TL;DR

  • What broke: Nginx closed the upstream TCP connection before the backend finished reading the full POST body or writing the response — triggered by undersized buffer directives and aggressive default timeouts.
  • How to fix it: Raise proxy_read_timeout, proxy_send_timeout, set proxy_request_buffering off (or size buffers correctly), and bump client_max_body_size to match your payload ceiling.
  • Fast path: Use our Client-Side Sandbox above to drop your nginx.conf and auto-refactor every relevant directive — secrets stay in your browser, never leave your machine.

The Incident (What Does the Error Mean?)

Your Nginx error log shows:

2024/01/15 03:42:17 [error] 1234#1234: *5678 recv() failed (104: Connection reset by peer)
  while reading response header from upstream,
  client: 10.0.1.45, server: api.example.com,
  request: "POST /api/v2/upload HTTP/1.1",
  upstream: "http://10.0.2.10:8080/api/v2/upload",
  host: "api.example.com"

Error code 104 = ECONNRESET. The upstream (your app server — Node, Gunicorn, uWSGI, whatever) forcibly closed the TCP socket. This happens in two distinct failure modes:

  1. Nginx is still sending the request body to upstream when upstream's own read timeout fires and it resets the connection.
  2. Nginx buffered the entire request body to disk/memory, then forwarded it, but proxy_read_timeout expired before the backend finished processing and sent back a single byte of response headers.

The client gets a 502 Bad Gateway. The upload is lost. If this is a payment processor callback, a webhook, or a file ingestion pipeline — that data is gone.


The Attack Vector / Blast Radius

This is not a security vulnerability — it is a cascading availability failure with a predictable blast radius:

  • Retry storms: Clients retry the failed POST. Each retry re-queues a large body. Under load, this saturates upstream worker threads simultaneously, compounding the original timeout.
  • Disk I/O spike: With default proxy_request_buffering on, Nginx writes the entire body to /var/cache/nginx temp files. A flood of large POSTs fills the temp directory partition. When /var fills, Nginx itself stops logging and can begin refusing connections.
  • Silent data loss: If the upstream is a stateful write (database insert, S3 upload initiation, ledger entry), the partial-write scenario leaves orphaned records. The client received a 502, so it retries — you now have duplicate processing risk unless your backend is idempotent.
  • Worker exhaustion: Each hung proxy connection holds an Nginx worker slot. With default worker_connections 1024, a sustained upload storm of slow clients can exhaust all workers, taking down every other endpoint on the same Nginx instance.

How to Fix It (The Solution)

Basic Fix

The minimum viable change for most deployments. Apply to your location block handling the large POST endpoint.

 http {
-    client_max_body_size 1m;
+    client_max_body_size 512m;

     server {
         location /api/v2/upload {
             proxy_pass http://backend_upstream;

-            # No timeout overrides — using Nginx defaults (60s read, 60s send)
+            proxy_read_timeout    300s;
+            proxy_send_timeout    300s;
+            proxy_connect_timeout 10s;

-            # Default: buffering on, small buffers
+            proxy_request_buffering off;
+            proxy_http_version      1.1;
+            proxy_set_header        Connection "";
         }
     }
 }

proxy_request_buffering off is the single highest-impact change. It makes Nginx stream the request body directly to upstream in real time instead of spooling it first. The upstream starts reading immediately, eliminating the race condition between Nginx's spool-then-forward delay and the upstream's read timeout.

proxy_http_version 1.1 + Connection "" enables keepalive to the upstream pool, eliminating per-request TCP handshake overhead that compounds under large-body load.


Enterprise Best Practice

For production systems handling multi-hundred-MB payloads, file uploads, or high-throughput webhook ingestion:

 http {
-    client_max_body_size    1m;
-    proxy_buffers           4 4k;
-    proxy_buffer_size       4k;
-    proxy_busy_buffers_size 8k;
+    client_max_body_size    1024m;
+    client_body_buffer_size 128k;       # buffer small bodies in RAM, not disk
+    client_body_temp_path   /dev/shm/nginx_tmp;  # use tmpfs, not slow disk

+    proxy_buffers           16 16k;
+    proxy_buffer_size       32k;
+    proxy_busy_buffers_size 64k;

     upstream backend_upstream {
         server 10.0.2.10:8080;
+        keepalive 32;                   # maintain persistent connections to upstream
+        keepalive_requests 1000;
+        keepalive_timeout  65s;
     }

     server {
         location /api/v2/upload {
             proxy_pass http://backend_upstream;

-            proxy_read_timeout 60s;
-            proxy_send_timeout 60s;
+            proxy_read_timeout    600s;  # accommodate slow processing backends
+            proxy_send_timeout    600s;
+            proxy_connect_timeout  5s;

+            proxy_request_buffering    off;
+            proxy_http_version         1.1;
+            proxy_set_header           Connection "";
+            proxy_set_header           X-Request-ID $request_id;

+            # Limit upload rate to prevent worker exhaustion
+            limit_req zone=upload_zone burst=20 nodelay;
+            limit_req_status 429;
         }
     }

+    limit_req_zone $binary_remote_addr zone=upload_zone:10m rate=5r/s;
 }

Critical tuning notes:

  • /dev/shm for temp path: Writes body buffers to RAM-backed tmpfs instead of spinning disk or even NVMe. Eliminates I/O wait during concurrent large uploads.
  • limit_req_zone: Without rate limiting, a single client can open 100 concurrent large-body uploads and exhaust all worker connections. 5 req/s per IP with burst 20 is a reasonable starting point — tune to your traffic profile.
  • keepalive 32 on upstream: Without this, every proxied request opens a new TCP connection to your app server. Under load this causes upstream port exhaustion (TIME_WAIT accumulation).

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

This misconfiguration is trivially detectable before it hits production. Add these gates:

1. gixy — Nginx static analysis (add to your pipeline)

# Install
pip install gixy

# Run in CI against your rendered nginx.conf
gixy /etc/nginx/nginx.conf
# gixy flags missing client_max_body_size and proxy timeout misconfigs

2. Checkov for Nginx-as-code (if using Terraform templatefile or Helm charts)

checkov -d ./nginx-configs --framework nginx

3. GitHub Actions gate — fail PR if client_max_body_size is default or proxy_read_timeout < 120s for upload endpoints:

# .github/workflows/nginx-lint.yml
jobs:
  nginx-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run gixy
        run: |
          pip install gixy
          gixy nginx/nginx.conf --output json | tee gixy-report.json
          # Fail if any HIGH severity findings
          python -c "
          import json,sys
          r=json.load(open('gixy-report.json'))
          highs=[i for i in r.get('issues',[]) if i['severity']=='HIGH']
          sys.exit(1 if highs else 0)
          "

4. Load test before every release touching upload paths:

# k6 smoke test — send 50MB POST, assert no 502s
k6 run --vus 10 --duration 30s - <<'EOF'
import http from 'k6/http';
import { check } from 'k6';
export default function () {
  const payload = open('/tmp/50mb_test.bin', 'b');
  const res = http.post('https://api.example.com/api/v2/upload', payload);
  check(res, { 'status is 200': (r) => r.status === 200 });
}
EOF

If your k6 run surfaces 502s at >50MB payloads in staging, the PR does not merge. No exceptions.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →