Initializing Enclave...

Fixing Nginx gRPC Passthrough 'recv() failed (104: Connection reset by peer)' — Root Cause & Production Fix

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: Nginx is tearing down the TCP connection to your gRPC upstream mid-flight. The upstream (your gRPC server or a pod behind a Kubernetes service) reset the connection before Nginx finished reading the response header — error code 104 is ECONNRESET.
  • How to fix it: Enable HTTP/2 on the upstream block, tune keepalive connections, set correct grpc_pass directives, and align upstream keepalive timeouts with your gRPC server's max_connection_age.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Nginx config and get a corrected diff without leaking your internal hostnames or upstream IPs.

The Incident (What Does the Error Mean?)

Raw error log:

2024/01/15 03:42:17 [error] 31#31: *18423 recv() failed (104: Connection reset by peer)
while reading response header from upstream,
client: 10.0.1.45, server: api.internal,
request: "POST /helloworld.Greeter/SayHello HTTP/2.0",
upstream: "grpc://10.96.0.42:50051",
host: "api.internal"

What this means operationally:

Error 104 (ECONNRESET) is the kernel telling you the remote end sent a TCP RST packet. In a gRPC passthrough context, this is almost never a transient blip — it signals a structural misconfiguration between Nginx's connection lifecycle and your gRPC server's connection lifecycle.

Nginx established (or reused) a TCP connection to the upstream, sent the HTTP/2 HEADERS frame for the gRPC call, and then the upstream closed the connection before sending back a single byte of response headers. Nginx received the RST, logged 104, and returned a 502 to the client.

Immediate consequence: Every affected gRPC call returns HTTP 502. In high-traffic environments, this triggers retry storms. If your client uses exponential backoff without jitter, you now have a thundering herd amplifying the outage.


The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but its blast radius is severe:

Root cause chain (most common):

  1. Keepalive connection reuse on a dead socket. Nginx's keepalive directive pools connections to the upstream. If the upstream gRPC server (e.g., gRPC-Go, gRPC-Java) has a MaxConnectionAge of 30s but Nginx's keepalive_timeout is 75s, Nginx will attempt to reuse a connection the server already closed. The server sends RST. Nginx logs 104.

  2. Missing http2 upstream directive. gRPC requires HTTP/2. If http2 is not declared on the upstream or the grpc_pass target isn't using the grpc:// scheme, Nginx negotiates HTTP/1.1 with the upstream. The gRPC server, expecting HTTP/2 frames, immediately resets the connection.

  3. Kubernetes pod restarts / rolling deployments. A pod terminates mid-connection. The kernel on the pod sends RST. Nginx has no health-check configured and keeps routing to the dead endpoint from its keepalive pool.

  4. Upstream gRPC server enforcing MaxConnectionIdle. gRPC servers actively close idle connections. If Nginx holds a pooled connection past the server's idle timeout, the next request on that connection gets RST.

Blast radius:

  • All gRPC clients behind this Nginx receive 502s simultaneously when the pool of stale connections is exhausted.
  • Retry logic amplifies upstream load 3–10x.
  • Health checks on the upstream may pass (TCP handshake succeeds to a new connection) while application-level requests fail on reused connections — making this invisible to naive monitoring.

How to Fix It (The Solution)

Basic Fix — Align Keepalive Timeouts and Enable HTTP/2

The single most common fix: set keepalive_timeout in Nginx below your gRPC server's MaxConnectionAge, and ensure http2 is enabled.

 upstream grpc_backend {
     server 10.96.0.42:50051;
-    # No keepalive configured — Nginx defaults cause stale connection reuse
+    keepalive 32;                  # Pool up to 32 idle connections per worker
+    keepalive_requests 1000;       # Max requests per keepalive connection
+    keepalive_timeout 20s;         # MUST be less than gRPC server MaxConnectionAge
 }

 server {
     listen 443 ssl http2;
     server_name api.internal;

     location /helloworld.Greeter/ {
-        proxy_pass http://grpc_backend;
-        proxy_http_version 1.1;
+        grpc_pass grpc://grpc_backend;
+        grpc_set_header Host $host;
+        grpc_set_header X-Real-IP $remote_addr;
+
+        # Critical: set connect/read timeouts for long-lived gRPC streams
+        grpc_connect_timeout 5s;
+        grpc_read_timeout 300s;
+        grpc_send_timeout 300s;
     }
 }

Enterprise Best Practice — Full Production-Grade gRPC Passthrough

For production: add active health checks (requires Nginx Plus or use with nginx_upstream_check_module), error interception, and proper SSL termination on the upstream leg if mTLS is required.

 upstream grpc_backend {
     zone grpc_backend 64k;
     server 10.96.0.42:50051;
     server 10.96.0.43:50051;        # Multiple pods
+    keepalive 64;
+    keepalive_requests 10000;
+    keepalive_timeout 20s;           # Below gRPC server MaxConnectionAge (typically 30s)
 }

 server {
     listen 443 ssl http2;
     server_name api.internal;

     ssl_certificate     /etc/nginx/certs/server.crt;
     ssl_certificate_key /etc/nginx/certs/server.key;
     ssl_protocols       TLSv1.2 TLSv1.3;

+    # Intercept upstream errors for proper gRPC status mapping
+    grpc_intercept_errors on;

     location /helloworld.Greeter/ {
-        proxy_pass http://grpc_backend;
-        proxy_set_header Upgrade $http_upgrade;
-        proxy_set_header Connection "upgrade";
+        grpc_pass grpcs://grpc_backend;  # grpcs:// if upstream uses TLS
+
+        grpc_set_header Host             $host;
+        grpc_set_header X-Real-IP        $remote_addr;
+        grpc_set_header X-Forwarded-For  $proxy_add_x_forwarded_for;
+        grpc_set_header X-Forwarded-Proto $scheme;
+
+        grpc_connect_timeout  5s;
+        grpc_read_timeout     300s;   # Match your longest expected streaming RPC
+        grpc_send_timeout     300s;
+
+        # Retry on connection reset — prevents 502 on stale keepalive connections
+        grpc_next_upstream     error timeout invalid_header;
+        grpc_next_upstream_tries 3;
+        grpc_next_upstream_timeout 10s;
     }

+    # Proper gRPC error page mapping
+    error_page 502 = /grpc_error_502;
+    location = /grpc_error_502 {
+        internal;
+        default_type application/grpc;
+        add_header grpc-status 14;     # UNAVAILABLE
+        add_header grpc-message "upstream unavailable";
+        return 204;
+    }
 }

Key parameters explained:

Parameter Why It Matters
keepalive_timeout 20s Must be shorter than gRPC server's MaxConnectionAge. Default gRPC-Go is 30s. Set Nginx to 20s.
grpc_next_upstream error timeout Automatically retries on a fresh connection when Nginx hits a reset socket from the pool. Eliminates the 104 error for clients.
keepalive_requests 10000 Prevents Nginx from closing connections too aggressively under high RPC volume.
grpc_read_timeout 300s Server-streaming and bidirectional RPCs stay open for minutes. Default 60s kills them.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

Don't let this regress. Enforce these checks in your pipeline:

1. Nginx Config Linting with nginx -t in CI

# .github/workflows/nginx-lint.yml
- name: Validate Nginx Config
  run: |
    docker run --rm -v $(pwd)/nginx:/etc/nginx:ro nginx:latest \
      nginx -t -c /etc/nginx/nginx.conf

2. Conftest / OPA Policy — Enforce grpc_pass and keepalive

# policy/nginx_grpc.rego
package nginx.grpc

deny[msg] {
    input.upstreams[_].keepalive_timeout == ""
    msg := "UPSTREAM BLOCK: keepalive_timeout must be explicitly set for gRPC upstreams. Stale connections cause error 104."
}

deny[msg] {
    loc := input.servers[_].locations[_]
    contains(loc.path, "grpc") 
    not startswith(loc.grpc_pass, "grpc")
    msg := sprintf("LOCATION '%v': gRPC paths must use grpc_pass with grpc:// or grpcs:// scheme, not proxy_pass.", [loc.path])
}

3. Integration Test — Validate gRPC Passthrough End-to-End

# Use grpcurl in your smoke test suite
grpcurl -plaintext \
  -d '{"name": "ci-probe"}' \
  -max-time 5 \
  api.internal:443 \
  helloworld.Greeter/SayHello

# Assert exit code 0 and no 'Code: Unavailable' in output
if echo "$output" | grep -q 'Code: Unavailable'; then
  echo "FAIL: gRPC upstream returning UNAVAILABLE — check Nginx keepalive config"
  exit 1
fi

4. Alert on 104 Errors in Production

# Prometheus alerting rule
- alert: NginxGrpcConnectionReset
  expr: |
    increase(nginx_ingress_controller_requests_total{
      status="502",
      upstream_addr=~"grpc.*"
    }[5m]) > 10
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Nginx gRPC upstream returning 502 — likely recv() 104 connection reset"
    runbook: "https://wiki.internal/runbooks/nginx-grpc-104"

The definitive checklist before deploying any Nginx gRPC config:

  • grpc_pass uses grpc:// or grpcs:// scheme (NOT proxy_pass)
  • keepalive_timeout set below upstream gRPC server MaxConnectionAge
  • keepalive connection pool size tuned to expected concurrency
  • grpc_next_upstream error timeout configured for automatic retry on stale connections
  • grpc_read_timeout matches longest expected streaming RPC duration
  • Active health checks configured (Nginx Plus) or grpc_next_upstream_tries > 1 as fallback
  • grpc_intercept_errors on with proper gRPC status code error pages

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →