Fixing Nginx gRPC Passthrough 'recv() failed (104: Connection reset by peer)' — Root Cause & Production Fix
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Nginx is tearing down the TCP connection to your gRPC upstream mid-flight. The upstream (your gRPC server or a pod behind a Kubernetes service) reset the connection before Nginx finished reading the response header — error code 104 is
ECONNRESET. - How to fix it: Enable HTTP/2 on the upstream block, tune
keepaliveconnections, set correctgrpc_passdirectives, and align upstream keepalive timeouts with your gRPC server'smax_connection_age. - Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your Nginx config and get a corrected diff without leaking your internal hostnames or upstream IPs.
The Incident (What Does the Error Mean?)
Raw error log:
2024/01/15 03:42:17 [error] 31#31: *18423 recv() failed (104: Connection reset by peer)
while reading response header from upstream,
client: 10.0.1.45, server: api.internal,
request: "POST /helloworld.Greeter/SayHello HTTP/2.0",
upstream: "grpc://10.96.0.42:50051",
host: "api.internal"
What this means operationally:
Error 104 (ECONNRESET) is the kernel telling you the remote end sent a TCP RST packet. In a gRPC passthrough context, this is almost never a transient blip — it signals a structural misconfiguration between Nginx's connection lifecycle and your gRPC server's connection lifecycle.
Nginx established (or reused) a TCP connection to the upstream, sent the HTTP/2 HEADERS frame for the gRPC call, and then the upstream closed the connection before sending back a single byte of response headers. Nginx received the RST, logged 104, and returned a 502 to the client.
Immediate consequence: Every affected gRPC call returns HTTP 502. In high-traffic environments, this triggers retry storms. If your client uses exponential backoff without jitter, you now have a thundering herd amplifying the outage.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a security exploit — but its blast radius is severe:
Root cause chain (most common):
Keepalive connection reuse on a dead socket. Nginx's
keepalivedirective pools connections to the upstream. If the upstream gRPC server (e.g., gRPC-Go, gRPC-Java) has aMaxConnectionAgeof 30s but Nginx'skeepalive_timeoutis 75s, Nginx will attempt to reuse a connection the server already closed. The server sends RST. Nginx logs 104.Missing
http2upstream directive. gRPC requires HTTP/2. Ifhttp2is not declared on the upstream or thegrpc_passtarget isn't using thegrpc://scheme, Nginx negotiates HTTP/1.1 with the upstream. The gRPC server, expecting HTTP/2 frames, immediately resets the connection.Kubernetes pod restarts / rolling deployments. A pod terminates mid-connection. The kernel on the pod sends RST. Nginx has no health-check configured and keeps routing to the dead endpoint from its keepalive pool.
Upstream gRPC server enforcing
MaxConnectionIdle. gRPC servers actively close idle connections. If Nginx holds a pooled connection past the server's idle timeout, the next request on that connection gets RST.
Blast radius:
- All gRPC clients behind this Nginx receive 502s simultaneously when the pool of stale connections is exhausted.
- Retry logic amplifies upstream load 3–10x.
- Health checks on the upstream may pass (TCP handshake succeeds to a new connection) while application-level requests fail on reused connections — making this invisible to naive monitoring.
How to Fix It (The Solution)
Basic Fix — Align Keepalive Timeouts and Enable HTTP/2
The single most common fix: set keepalive_timeout in Nginx below your gRPC server's MaxConnectionAge, and ensure http2 is enabled.
upstream grpc_backend {
server 10.96.0.42:50051;
- # No keepalive configured — Nginx defaults cause stale connection reuse
+ keepalive 32; # Pool up to 32 idle connections per worker
+ keepalive_requests 1000; # Max requests per keepalive connection
+ keepalive_timeout 20s; # MUST be less than gRPC server MaxConnectionAge
}
server {
listen 443 ssl http2;
server_name api.internal;
location /helloworld.Greeter/ {
- proxy_pass http://grpc_backend;
- proxy_http_version 1.1;
+ grpc_pass grpc://grpc_backend;
+ grpc_set_header Host $host;
+ grpc_set_header X-Real-IP $remote_addr;
+
+ # Critical: set connect/read timeouts for long-lived gRPC streams
+ grpc_connect_timeout 5s;
+ grpc_read_timeout 300s;
+ grpc_send_timeout 300s;
}
}
Enterprise Best Practice — Full Production-Grade gRPC Passthrough
For production: add active health checks (requires Nginx Plus or use with nginx_upstream_check_module), error interception, and proper SSL termination on the upstream leg if mTLS is required.
upstream grpc_backend {
zone grpc_backend 64k;
server 10.96.0.42:50051;
server 10.96.0.43:50051; # Multiple pods
+ keepalive 64;
+ keepalive_requests 10000;
+ keepalive_timeout 20s; # Below gRPC server MaxConnectionAge (typically 30s)
}
server {
listen 443 ssl http2;
server_name api.internal;
ssl_certificate /etc/nginx/certs/server.crt;
ssl_certificate_key /etc/nginx/certs/server.key;
ssl_protocols TLSv1.2 TLSv1.3;
+ # Intercept upstream errors for proper gRPC status mapping
+ grpc_intercept_errors on;
location /helloworld.Greeter/ {
- proxy_pass http://grpc_backend;
- proxy_set_header Upgrade $http_upgrade;
- proxy_set_header Connection "upgrade";
+ grpc_pass grpcs://grpc_backend; # grpcs:// if upstream uses TLS
+
+ grpc_set_header Host $host;
+ grpc_set_header X-Real-IP $remote_addr;
+ grpc_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+ grpc_set_header X-Forwarded-Proto $scheme;
+
+ grpc_connect_timeout 5s;
+ grpc_read_timeout 300s; # Match your longest expected streaming RPC
+ grpc_send_timeout 300s;
+
+ # Retry on connection reset — prevents 502 on stale keepalive connections
+ grpc_next_upstream error timeout invalid_header;
+ grpc_next_upstream_tries 3;
+ grpc_next_upstream_timeout 10s;
}
+ # Proper gRPC error page mapping
+ error_page 502 = /grpc_error_502;
+ location = /grpc_error_502 {
+ internal;
+ default_type application/grpc;
+ add_header grpc-status 14; # UNAVAILABLE
+ add_header grpc-message "upstream unavailable";
+ return 204;
+ }
}
Key parameters explained:
| Parameter | Why It Matters |
|---|---|
keepalive_timeout 20s |
Must be shorter than gRPC server's MaxConnectionAge. Default gRPC-Go is 30s. Set Nginx to 20s. |
grpc_next_upstream error timeout |
Automatically retries on a fresh connection when Nginx hits a reset socket from the pool. Eliminates the 104 error for clients. |
keepalive_requests 10000 |
Prevents Nginx from closing connections too aggressively under high RPC volume. |
grpc_read_timeout 300s |
Server-streaming and bidirectional RPCs stay open for minutes. Default 60s kills them. |
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
Don't let this regress. Enforce these checks in your pipeline:
1. Nginx Config Linting with nginx -t in CI
# .github/workflows/nginx-lint.yml
- name: Validate Nginx Config
run: |
docker run --rm -v $(pwd)/nginx:/etc/nginx:ro nginx:latest \
nginx -t -c /etc/nginx/nginx.conf
2. Conftest / OPA Policy — Enforce grpc_pass and keepalive
# policy/nginx_grpc.rego
package nginx.grpc
deny[msg] {
input.upstreams[_].keepalive_timeout == ""
msg := "UPSTREAM BLOCK: keepalive_timeout must be explicitly set for gRPC upstreams. Stale connections cause error 104."
}
deny[msg] {
loc := input.servers[_].locations[_]
contains(loc.path, "grpc")
not startswith(loc.grpc_pass, "grpc")
msg := sprintf("LOCATION '%v': gRPC paths must use grpc_pass with grpc:// or grpcs:// scheme, not proxy_pass.", [loc.path])
}
3. Integration Test — Validate gRPC Passthrough End-to-End
# Use grpcurl in your smoke test suite
grpcurl -plaintext \
-d '{"name": "ci-probe"}' \
-max-time 5 \
api.internal:443 \
helloworld.Greeter/SayHello
# Assert exit code 0 and no 'Code: Unavailable' in output
if echo "$output" | grep -q 'Code: Unavailable'; then
echo "FAIL: gRPC upstream returning UNAVAILABLE — check Nginx keepalive config"
exit 1
fi
4. Alert on 104 Errors in Production
# Prometheus alerting rule
- alert: NginxGrpcConnectionReset
expr: |
increase(nginx_ingress_controller_requests_total{
status="502",
upstream_addr=~"grpc.*"
}[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Nginx gRPC upstream returning 502 — likely recv() 104 connection reset"
runbook: "https://wiki.internal/runbooks/nginx-grpc-104"
The definitive checklist before deploying any Nginx gRPC config:
-
grpc_passusesgrpc://orgrpcs://scheme (NOTproxy_pass) -
keepalive_timeoutset below upstream gRPC serverMaxConnectionAge -
keepaliveconnection pool size tuned to expected concurrency -
grpc_next_upstream error timeoutconfigured for automatic retry on stale connections -
grpc_read_timeoutmatches longest expected streaming RPC duration - Active health checks configured (Nginx Plus) or
grpc_next_upstream_tries > 1as fallback -
grpc_intercept_errors onwith proper gRPC status code error pages