What is the exact cause of 'recv() failed (104: Connection reset by peer)' in Nginx gRPC passthrough?

Error 104 is ECONNRESET — the upstream gRPC server sent a TCP RST packet, closing the connection before Nginx received any response headers. The most common causes are: (1) Nginx reusing a keepalive connection that the gRPC server already closed due to MaxConnectionAge expiry, (2) using proxy_pass instead of grpc_pass which forces HTTP/1.1 negotiation that the gRPC server immediately rejects, or (3) a Kubernetes pod restart mid-connection. Fix it by setting keepalive_timeout below the server's MaxConnectionAge and switching to grpc_pass with the grpc:// scheme.

How do I set the correct keepalive_timeout value for Nginx gRPC upstreams?

Your Nginx upstream keepalive_timeout must be strictly less than the gRPC server's MaxConnectionAge setting. For gRPC-Go servers, the default MaxConnectionAge is 30 seconds. Set Nginx keepalive_timeout to 20s as a safe margin. For gRPC-Java servers, check the ServerBuilder.maxConnectionAge() configuration. If you control the gRPC server, you can also increase MaxConnectionAge to 300s or higher for long-lived connections, but you must still set Nginx's keepalive_timeout below that value.

Does grpc_next_upstream fix the 502 errors caused by stale keepalive connections?

Yes. Adding 'grpc_next_upstream error timeout invalid_header' to your location block instructs Nginx to automatically retry the request on a different (or fresh) upstream connection when it encounters a reset socket from the keepalive pool. This makes the 104 error transparent to the client in most cases. Set grpc_next_upstream_tries to 2 or 3 and grpc_next_upstream_timeout to 10s. Note: this is safe for unary RPCs but be cautious with non-idempotent server-streaming RPCs where a retry may cause duplicate side effects.

Fixing Nginx gRPC Passthrough 'recv() failed (104: Connection reset by peer)' — Root Cause & Production Fix

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Nginx is tearing down the TCP connection to your gRPC upstream mid-flight. The upstream (your gRPC server or a pod behind a Kubernetes service) reset the connection before Nginx finished reading the response header — error code 104 is ECONNRESET.
How to fix it: Enable HTTP/2 on the upstream block, tune keepalive connections, set correct grpc_pass directives, and align upstream keepalive timeouts with your gRPC server's max_connection_age.
Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Nginx config and get a corrected diff without leaking your internal hostnames or upstream IPs.

The Incident (What Does the Error Mean?)

Raw error log:

2024/01/15 03:42:17 [error] 31#31: *18423 recv() failed (104: Connection reset by peer)
while reading response header from upstream,
client: 10.0.1.45, server: api.internal,
request: "POST /helloworld.Greeter/SayHello HTTP/2.0",
upstream: "grpc://10.96.0.42:50051",
host: "api.internal"

What this means operationally:

Error 104 (ECONNRESET) is the kernel telling you the remote end sent a TCP RST packet. In a gRPC passthrough context, this is almost never a transient blip — it signals a structural misconfiguration between Nginx's connection lifecycle and your gRPC server's connection lifecycle.

Nginx established (or reused) a TCP connection to the upstream, sent the HTTP/2 HEADERS frame for the gRPC call, and then the upstream closed the connection before sending back a single byte of response headers. Nginx received the RST, logged 104, and returned a 502 to the client.

Immediate consequence: Every affected gRPC call returns HTTP 502. In high-traffic environments, this triggers retry storms. If your client uses exponential backoff without jitter, you now have a thundering herd amplifying the outage.

The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but its blast radius is severe:

Root cause chain (most common):

Keepalive connection reuse on a dead socket. Nginx's keepalive directive pools connections to the upstream. If the upstream gRPC server (e.g., gRPC-Go, gRPC-Java) has a MaxConnectionAge of 30s but Nginx's keepalive_timeout is 75s, Nginx will attempt to reuse a connection the server already closed. The server sends RST. Nginx logs 104.
Missing http2 upstream directive. gRPC requires HTTP/2. If http2 is not declared on the upstream or the grpc_pass target isn't using the grpc:// scheme, Nginx negotiates HTTP/1.1 with the upstream. The gRPC server, expecting HTTP/2 frames, immediately resets the connection.
Kubernetes pod restarts / rolling deployments. A pod terminates mid-connection. The kernel on the pod sends RST. Nginx has no health-check configured and keeps routing to the dead endpoint from its keepalive pool.
Upstream gRPC server enforcing MaxConnectionIdle. gRPC servers actively close idle connections. If Nginx holds a pooled connection past the server's idle timeout, the next request on that connection gets RST.

Blast radius:

All gRPC clients behind this Nginx receive 502s simultaneously when the pool of stale connections is exhausted.
Retry logic amplifies upstream load 3–10x.
Health checks on the upstream may pass (TCP handshake succeeds to a new connection) while application-level requests fail on reused connections — making this invisible to naive monitoring.

How to Fix It (The Solution)

Basic Fix — Align Keepalive Timeouts and Enable HTTP/2

The single most common fix: set keepalive_timeout in Nginx below your gRPC server's MaxConnectionAge, and ensure http2 is enabled.

 upstream grpc_backend {
     server 10.96.0.42:50051;
-    # No keepalive configured — Nginx defaults cause stale connection reuse
+    keepalive 32;                  # Pool up to 32 idle connections per worker
+    keepalive_requests 1000;       # Max requests per keepalive connection
+    keepalive_timeout 20s;         # MUST be less than gRPC server MaxConnectionAge
 }

 server {
     listen 443 ssl http2;
     server_name api.internal;

     location /helloworld.Greeter/ {
-        proxy_pass http://grpc_backend;
-        proxy_http_version 1.1;
+        grpc_pass grpc://grpc_backend;
+        grpc_set_header Host $host;
+        grpc_set_header X-Real-IP $remote_addr;
+
+        # Critical: set connect/read timeouts for long-lived gRPC streams
+        grpc_connect_timeout 5s;
+        grpc_read_timeout 300s;
+        grpc_send_timeout 300s;
     }
 }

Enterprise Best Practice — Full Production-Grade gRPC Passthrough

For production: add active health checks (requires Nginx Plus or use with nginx_upstream_check_module), error interception, and proper SSL termination on the upstream leg if mTLS is required.

 upstream grpc_backend {
     zone grpc_backend 64k;
     server 10.96.0.42:50051;
     server 10.96.0.43:50051;        # Multiple pods
+    keepalive 64;
+    keepalive_requests 10000;
+    keepalive_timeout 20s;           # Below gRPC server MaxConnectionAge (typically 30s)
 }

 server {
     listen 443 ssl http2;
     server_name api.internal;

     ssl_certificate     /etc/nginx/certs/server.crt;
     ssl_certificate_key /etc/nginx/certs/server.key;
     ssl_protocols       TLSv1.2 TLSv1.3;

+    # Intercept upstream errors for proper gRPC status mapping
+    grpc_intercept_errors on;

     location /helloworld.Greeter/ {
-        proxy_pass http://grpc_backend;
-        proxy_set_header Upgrade $http_upgrade;
-        proxy_set_header Connection "upgrade";
+        grpc_pass grpcs://grpc_backend;  # grpcs:// if upstream uses TLS
+
+        grpc_set_header Host             $host;
+        grpc_set_header X-Real-IP        $remote_addr;
+        grpc_set_header X-Forwarded-For  $proxy_add_x_forwarded_for;
+        grpc_set_header X-Forwarded-Proto $scheme;
+
+        grpc_connect_timeout  5s;
+        grpc_read_timeout     300s;   # Match your longest expected streaming RPC
+        grpc_send_timeout     300s;
+
+        # Retry on connection reset — prevents 502 on stale keepalive connections
+        grpc_next_upstream     error timeout invalid_header;
+        grpc_next_upstream_tries 3;
+        grpc_next_upstream_timeout 10s;
     }

+    # Proper gRPC error page mapping
+    error_page 502 = /grpc_error_502;
+    location = /grpc_error_502 {
+        internal;
+        default_type application/grpc;
+        add_header grpc-status 14;     # UNAVAILABLE
+        add_header grpc-message "upstream unavailable";
+        return 204;
+    }
 }

Key parameters explained:

Parameter	Why It Matters
`keepalive_timeout 20s`	Must be shorter than gRPC server's `MaxConnectionAge`. Default gRPC-Go is 30s. Set Nginx to 20s.
`grpc_next_upstream error timeout`	Automatically retries on a fresh connection when Nginx hits a reset socket from the pool. Eliminates the 104 error for clients.
`keepalive_requests 10000`	Prevents Nginx from closing connections too aggressively under high RPC volume.
`grpc_read_timeout 300s`	Server-streaming and bidirectional RPCs stay open for minutes. Default 60s kills them.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

Don't let this regress. Enforce these checks in your pipeline:

1. Nginx Config Linting with `nginx -t` in CI

# .github/workflows/nginx-lint.yml
- name: Validate Nginx Config
  run: |
    docker run --rm -v $(pwd)/nginx:/etc/nginx:ro nginx:latest \
      nginx -t -c /etc/nginx/nginx.conf

2. Conftest / OPA Policy — Enforce `grpc_pass` and `keepalive`

# policy/nginx_grpc.rego
package nginx.grpc

deny[msg] {
    input.upstreams[_].keepalive_timeout == ""
    msg := "UPSTREAM BLOCK: keepalive_timeout must be explicitly set for gRPC upstreams. Stale connections cause error 104."
}

deny[msg] {
    loc := input.servers[_].locations[_]
    contains(loc.path, "grpc") 
    not startswith(loc.grpc_pass, "grpc")
    msg := sprintf("LOCATION '%v': gRPC paths must use grpc_pass with grpc:// or grpcs:// scheme, not proxy_pass.", [loc.path])
}

3. Integration Test — Validate gRPC Passthrough End-to-End

# Use grpcurl in your smoke test suite
grpcurl -plaintext \
  -d '{"name": "ci-probe"}' \
  -max-time 5 \
  api.internal:443 \
  helloworld.Greeter/SayHello

# Assert exit code 0 and no 'Code: Unavailable' in output
if echo "$output" | grep -q 'Code: Unavailable'; then
  echo "FAIL: gRPC upstream returning UNAVAILABLE — check Nginx keepalive config"
  exit 1
fi

4. Alert on 104 Errors in Production

# Prometheus alerting rule
- alert: NginxGrpcConnectionReset
  expr: |
    increase(nginx_ingress_controller_requests_total{
      status="502",
      upstream_addr=~"grpc.*"
    }[5m]) > 10
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Nginx gRPC upstream returning 502 — likely recv() 104 connection reset"
    runbook: "https://wiki.internal/runbooks/nginx-grpc-104"

The definitive checklist before deploying any Nginx gRPC config:

grpc_pass uses grpc:// or grpcs:// scheme (NOT proxy_pass)
keepalive_timeout set below upstream gRPC server MaxConnectionAge
keepalive connection pool size tuned to expected concurrency
grpc_next_upstream error timeout configured for automatic retry on stale connections
grpc_read_timeout matches longest expected streaming RPC duration
Active health checks configured (Nginx Plus) or grpc_next_upstream_tries > 1 as fallback
grpc_intercept_errors on with proper gRPC status code error pages