Initializing Enclave...

Fixing Ingress-Nginx Readiness Probe 503 Failures After Deployment Rollout

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: Ingress-nginx pods are returning HTTP 503 on their readiness probe endpoint (/healthz or /nginx-status), causing Kubernetes to mark them NotReady and pull them from the load balancer — your rollout is deadlocked or your cluster has zero healthy ingress replicas.
  • How to fix it: Verify the probe port matches the controller's actual --health-check-path and healthz-port args; confirm upstream Services and Endpoints exist and are populated before the controller starts; check for externalTrafficPolicy: Local with missing node-level endpoints.
  • Sandbox: Use our Client-Side Sandbox below to auto-refactor your Deployment YAML and surface the exact misconfiguration without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw event output from kubectl describe pod -n ingress-nginx <controller-pod>:

Warning  Unhealthy  3s    kubelet  Readiness probe failed:
HTTP probe failed with statuscode: 503
GET http://10.244.1.47:10254/healthz

Kubernetes hit the readiness probe endpoint and got back 503 Service Unavailable instead of 200 OK. The pod is alive (liveness hasn't fired yet) but Kubernetes will not route traffic to it. If this affects all replicas simultaneously during a rolling update, your ingress layer goes completely dark — every Ingress resource in the cluster stops serving traffic. This is not a graceful degradation. It is a full ingress outage.


The Attack Vector / Blast Radius

This failure cascades fast:

  1. Rollout deadlock: With maxUnavailable: 0 (default in many hardened configs), the rollout waits indefinitely for the new pod to become Ready. It never does. The old pods are still running but the rollout is frozen — your CI/CD pipeline hangs, alerts fire, and engineers start force-killing things.
  2. Full ingress blackout: If old pods are already terminated (aggressive terminationGracePeriodSeconds or manual drain), and new pods are stuck in 0/1 Ready, zero ingress-nginx pods are in the Service endpoints slice. Every HTTP/HTTPS request to your cluster returns a connection refused or LB 502.
  3. Cascading cert-manager failures: cert-manager HTTP-01 ACME challenges route through ingress-nginx. A 503 readiness failure during certificate renewal silently breaks TLS renewal — you find out 30 days later when certs expire.
  4. Root causes ranked by frequency:
    • Port mismatch between probe definition and --healthz-port arg (most common after Helm chart version bumps)
    • Upstream backend not ready — the controller is healthy but a default backend Service has no Endpoints
    • externalTrafficPolicy: Local — node has no local endpoints, kube-proxy drops the probe
    • Resource starvation — OOMKilled nginx worker, controller returns 503 on /healthz while restarting
    • Webhook misconfigurationValidatingWebhookConfiguration rejecting ingress objects, controller enters error state

How to Fix It

Basic Fix — Verify and Align Probe Port

The most common cause: the Helm chart or manual YAML has the readiness probe hitting the wrong port after an upgrade changed --healthz-port.

# ingress-nginx controller Deployment
spec:
  containers:
  - name: controller
    args:
      - /nginx-ingress-controller
-     - --healthz-port=10254
+     - --healthz-port=10254  # confirm this matches probe below
    readinessProbe:
      httpGet:
        path: /healthz
-       port: 80
+       port: 10254
      initialDelaySeconds: 10
-     periodSeconds: 5
+     periodSeconds: 10
      failureThreshold: 3

Verify what port the controller is actually binding:

kubectl exec -n ingress-nginx <pod> -- ss -tlnp | grep nginx
# or
kubectl exec -n ingress-nginx <pod> -- curl -sv http://localhost:10254/healthz

If curl returns 200, the probe port is wrong. If it returns 503, the controller itself is unhealthy — check logs:

kubectl logs -n ingress-nginx <pod> --previous
kubectl logs -n ingress-nginx <pod> | grep -E 'error|FATAL|backend'

Enterprise Best Practice — Hardened Probe Config + Default Backend Validation

# values.yaml (Helm ingress-nginx)
controller:
  readinessProbe:
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
-   initialDelaySeconds: 10
+   initialDelaySeconds: 30   # give nginx workers time to fork on large nodes
-   periodSeconds: 5
+   periodSeconds: 10
-   failureThreshold: 3
+   failureThreshold: 6       # tolerate transient upstream blips during rollout
    successThreshold: 1
    timeoutSeconds: 5

  livenessProbe:
    httpGet:
      path: /healthz
      port: 10254
      scheme: HTTP
    initialDelaySeconds: 30
    periodSeconds: 10

# Ensure default backend Deployment is healthy BEFORE controller rolls out
defaultBackend:
  enabled: true
  replicaCount: 2           # was 1 — single replica = SPOF during its own rollout

# Prevent simultaneous replica loss
  minReadySeconds: 15
  strategy:
    rollingUpdate:
-     maxUnavailable: 1
+     maxUnavailable: 0
-     maxSurge: 1
+     maxSurge: 1

If using externalTrafficPolicy: Local, the readiness probe from kubelet originates from the node IP. If no ingress-nginx pod is scheduled on that node, kube-proxy has no local endpoint and drops the probe. Fix:

# Service for ingress-nginx controller
spec:
- externalTrafficPolicy: Local
+ externalTrafficPolicy: Cluster   # unless you specifically need src IP preservation

Or if you must keep Local, ensure podAntiAffinity is NOT preventing scheduling on probe-originating nodes.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Helm Chart Lint + Conftest Policy

Add an OPA/Conftest policy that enforces readiness probe port matches --healthz-port arg:

# policy/ingress_probe.rego
package ingress_nginx

deny[msg] {
  container := input.spec.template.spec.containers[_]
  container.name == "controller"
  probe_port := container.readinessProbe.httpGet.port
  healthz_arg := [a | a := container.args[_]; startswith(a, "--healthz-port=")][0]
  expected_port := to_number(split(healthz_arg, "=")[1])
  probe_port != expected_port
  msg := sprintf("Readiness probe port %v does not match --healthz-port %v", [probe_port, expected_port])
}
# In CI pipeline
helm template ingress-nginx ingress-nginx/ingress-nginx -f values.yaml | \
  conftest test - --policy policy/

2. Checkov Static Analysis

checkov -f controller-deployment.yaml \
  --check CKV_K8S_8  # readiness probe defined
  --check CKV_K8S_9  # liveness probe defined

3. Pre-Rollout Smoke Test in ArgoCD / Flux

Add a PreSync Job or ArgoCD Resource Hook that validates the default backend is healthy before the controller rolls:

apiVersion: batch/v1
kind: Job
metadata:
  name: pre-rollout-backend-check
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
      - name: check
        image: curlimages/curl:8.7.1
        command:
          - sh
          - -c
          - |
            curl -sf http://ingress-nginx-defaultbackend.ingress-nginx.svc.cluster.local:8080/healthz \
              || (echo "Default backend not ready. Blocking rollout."; exit 1)
      restartPolicy: Never

4. Monitor with Prometheus Alert

# PrometheusRule
- alert: IngressNginxControllerNotReady
  expr: kube_pod_container_status_ready{namespace="ingress-nginx",container="controller"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "All ingress-nginx controller pods are NotReady — cluster ingress is down"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →