Initializing Enclave...

Fixing 504 Gateway Timeout in Kubernetes Nginx Ingress for Long-Running API Requests

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10–20 mins

TL;DR

  • What broke: Nginx Ingress controller is terminating upstream connections after its default 60-second proxy timeout, returning 504 Gateway Timeout to clients before your long-running API (batch job, ML inference, file upload, webhook) finishes.
  • How to fix it: Patch the Ingress resource with nginx.ingress.kubernetes.io/proxy-read-timeout and nginx.ingress.kubernetes.io/proxy-send-timeout annotations, and align your upstream Service and pod-level keep-alive settings.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your Ingress YAML — paste your manifest, get corrected annotations back in seconds without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

The raw error your client or upstream proxy receives:

HTTP/1.1 504 Gateway Time-out
Server: nginx/1.25.3
Date: Mon, 01 Jul 2024 14:32:07 GMT
Content-Type: text/html
Content-Length: 167

<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.25.3</center>
</body>
</html>

And in the Nginx Ingress controller pod logs:

2024/07/01 14:32:07 [error] 31#31: *10482 upstream timed out (110: Connection timed out)
while reading response header from upstream, client: 10.0.1.45,
server: api.example.com, request: "POST /api/v1/batch-process HTTP/1.1",
upstream: "http://10.96.14.22:8080/api/v1/batch-process",
host: "api.example.com"

Immediate consequence: Every API call that takes longer than 60 seconds (the Nginx default) is hard-killed at the proxy layer. Your upstream pod is still processing — it has no idea the client is gone. This means orphaned goroutines/threads on the backend, wasted compute, and a client that must retry from scratch.


The Attack Vector / Blast Radius

This is a cascading availability failure, not a one-shot timeout:

  1. Client retries amplify backend load. A frustrated client (or an SDK with automatic retry logic) re-submits the same expensive request. Your backend now has two in-flight copies of the same job. Multiply by N users and you have a thundering herd.
  2. Orphaned upstream connections exhaust worker pools. The upstream pod never received a RST — it keeps the connection open and burns a worker thread until its own idle timeout fires. Under load, this drains your Gunicorn/uWSGI/Tomcat worker pool entirely, causing subsequent fast requests to also queue and timeout.
  3. HPA fires on CPU/memory, not on the real cause. Horizontal Pod Autoscaler sees high resource usage from the orphaned workers and spins up more pods. Each new pod inherits the same misconfigured Ingress. Scaling does nothing; you burn money and still serve 504s.
  4. Data integrity risk on non-idempotent endpoints. If your batch job writes to a database mid-flight when the proxy kills the connection, you may get partial writes. Without proper transaction rollback, this is a silent data corruption vector.
  5. SLO breach. A 504 is logged as a server-side error. Your error budget drains. If you have an alerting threshold on 5xx rate, this triggers PagerDuty at 3am.

The blast radius is not just one endpoint — a single misconfigured Ingress resource with pathType: Prefix on / covers every route in that virtual host.


How to Fix It (The Solution)

Root Cause Mapping

Nginx Directive Ingress Annotation Default What It Controls
proxy_read_timeout nginx.ingress.kubernetes.io/proxy-read-timeout 60 Time to wait for upstream to send a response body chunk
proxy_send_timeout nginx.ingress.kubernetes.io/proxy-send-timeout 60 Time between successive writes from Nginx to upstream
proxy_connect_timeout nginx.ingress.kubernetes.io/proxy-connect-timeout 5 TCP handshake timeout to upstream pod
keepalive_timeout Set via ConfigMap keep-alive 75 How long idle keep-alive connections persist

Basic Fix — Ingress Annotation Patch

Apply timeout annotations directly to the Ingress resource for the affected route.

 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
   name: api-ingress
   namespace: production
   annotations:
     kubernetes.io/ingress.class: "nginx"
+    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
+    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
+    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
+    nginx.ingress.kubernetes.io/proxy-next-upstream-timeout: "300"
 spec:
   rules:
   - host: api.example.com
     http:
       paths:
       - path: /api/v1/batch-process
         pathType: Prefix
         backend:
           service:
             name: batch-api-service
             port:
               number: 8080

⚠️ Scope your annotations tightly. Apply long timeouts only to the specific path that needs them. Do not set 300s globally on a catch-all Ingress — you will mask hung connections on every other route.


Enterprise Best Practice — Scoped Ingress + ConfigMap + Upstream Alignment

Step 1: Split your long-running routes into a dedicated Ingress resource.

This prevents a single proxy-read-timeout: 300 annotation from bleeding into your fast API paths.

-# Old: single Ingress covering all paths
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
   name: api-ingress-batch
   namespace: production
   annotations:
     kubernetes.io/ingress.class: "nginx"
-    # No timeout overrides — inherits 60s default
+    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
+    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
+    nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
+    nginx.ingress.kubernetes.io/proxy-buffering: "off"
+    nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
 spec:
   ingressClassName: nginx
   rules:
   - host: api.example.com
     http:
       paths:
       - path: /api/v1/batch-process
         pathType: Exact
         backend:
           service:
             name: batch-api-service
             port:
               number: 8080

Step 2: Align the Nginx Ingress Controller ConfigMap.

The per-Ingress annotations override ConfigMap values for matched routes, but the ConfigMap governs global keep-alive behavior that affects all connections:

 apiVersion: v1
 kind: ConfigMap
 metadata:
   name: ingress-nginx-controller
   namespace: ingress-nginx
 data:
   use-forwarded-headers: "true"
   compute-full-forwarded-for: "true"
+  keep-alive: "75"
+  keep-alive-requests: "10000"
+  upstream-keepalive-connections: "200"
+  upstream-keepalive-timeout: "300"
+  upstream-keepalive-requests: "10000"
-  # proxy-read-timeout: "60"   <-- this was the global kill switch
+  proxy-read-timeout: "60"     # Keep global default conservative
+  proxy-send-timeout: "60"     # Long-running routes override via annotations

Step 3: Set terminationGracePeriodSeconds on your upstream Deployment.

If Kubernetes kills the pod mid-request (during a rolling deploy or scale-down), the upstream dies before Nginx times out — causing a different class of 504. Align the pod grace period with your maximum expected request duration:

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: batch-api
   namespace: production
 spec:
   template:
     spec:
+      terminationGracePeriodSeconds: 330
       containers:
       - name: batch-api
         image: myrepo/batch-api:v2.1.0
+        lifecycle:
+          preStop:
+            exec:
+              command: ["/bin/sh", "-c", "sleep 5"]

The preStop sleep gives Nginx time to stop routing new requests to the pod before the SIGTERM fires.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. OPA/Gatekeeper Policy — Enforce Timeout Annotations on Long-Running Path Ingresses

Tag Ingress resources that serve async/batch routes with a label, then enforce the annotation contract:

# opa/policies/ingress-timeout-policy.rego
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Ingress"
  input.request.object.metadata.labels["route-type"] == "long-running"
  not input.request.object.metadata.annotations["nginx.ingress.kubernetes.io/proxy-read-timeout"]
  msg := "Long-running Ingress routes MUST declare nginx.ingress.kubernetes.io/proxy-read-timeout"
}

deny[msg] {
  input.request.kind.kind == "Ingress"
  input.request.object.metadata.labels["route-type"] == "long-running"
  timeout := to_number(input.request.object.metadata.annotations["nginx.ingress.kubernetes.io/proxy-read-timeout"])
  timeout < 120
  msg := sprintf("proxy-read-timeout value %v is below minimum 120s for long-running routes", [timeout])
}

2. Checkov — Static Analysis in Pull Requests

Add a custom Checkov check to your CI pipeline:

# .github/workflows/k8s-lint.yml
- name: Run Checkov on Kubernetes manifests
  uses: bridgecrewio/checkov-action@master
  with:
    directory: k8s/
    framework: kubernetes
    check: CKV_K8S_INGRESS_TIMEOUT   # custom check ID
    soft_fail: false

3. Helm Values Schema Validation

If you manage Ingress via Helm, enforce timeout values in values.schema.json:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "properties": {
    "ingress": {
      "properties": {
        "proxyReadTimeout": {
          "type": "integer",
          "minimum": 60,
          "maximum": 600,
          "description": "Nginx proxy-read-timeout in seconds. Must be set explicitly."
        }
      },
      "required": ["proxyReadTimeout"]
    }
  }
}

4. Synthetic Monitoring Canary

Deploy a Prometheus Blackbox Exporter probe that fires a synthetic long-running request (e.g., a /healthz/slow?delay=90 endpoint on your service) every 5 minutes. Alert if it returns 504:

# prometheus/rules/ingress-timeout.yml
groups:
- name: ingress.timeout
  rules:
  - alert: LongRunningEndpoint504
    expr: probe_http_status_code{job="blackbox-batch-api"} == 504
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "504 on long-running API endpoint — check Ingress proxy-read-timeout"

This catches timeout regressions the moment a new Ingress manifest is deployed, before real users hit it.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →