Fixing Ingress-Nginx Readiness Probe 503 Failures After Deployment Rollout
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: Ingress-nginx pods are returning HTTP 503 on their readiness probe endpoint (
/healthzor/nginx-status), causing Kubernetes to mark themNotReadyand pull them from the load balancer — your rollout is deadlocked or your cluster has zero healthy ingress replicas. - How to fix it: Verify the probe port matches the controller's actual
--health-check-pathandhealthz-portargs; confirm upstream Services and Endpoints exist and are populated before the controller starts; check forexternalTrafficPolicy: Localwith missing node-level endpoints. - Sandbox: Use our Client-Side Sandbox below to auto-refactor your Deployment YAML and surface the exact misconfiguration without sending your config to a third-party server.
The Incident (What Does the Error Mean?)
Raw event output from kubectl describe pod -n ingress-nginx <controller-pod>:
Warning Unhealthy 3s kubelet Readiness probe failed:
HTTP probe failed with statuscode: 503
GET http://10.244.1.47:10254/healthz
Kubernetes hit the readiness probe endpoint and got back 503 Service Unavailable instead of 200 OK. The pod is alive (liveness hasn't fired yet) but Kubernetes will not route traffic to it. If this affects all replicas simultaneously during a rolling update, your ingress layer goes completely dark — every Ingress resource in the cluster stops serving traffic. This is not a graceful degradation. It is a full ingress outage.
The Attack Vector / Blast Radius
This failure cascades fast:
- Rollout deadlock: With
maxUnavailable: 0(default in many hardened configs), the rollout waits indefinitely for the new pod to become Ready. It never does. The old pods are still running but the rollout is frozen — your CI/CD pipeline hangs, alerts fire, and engineers start force-killing things. - Full ingress blackout: If old pods are already terminated (aggressive
terminationGracePeriodSecondsor manual drain), and new pods are stuck in0/1 Ready, zero ingress-nginx pods are in the Service endpoints slice. Every HTTP/HTTPS request to your cluster returns a connection refused or LB 502. - Cascading cert-manager failures: cert-manager HTTP-01 ACME challenges route through ingress-nginx. A 503 readiness failure during certificate renewal silently breaks TLS renewal — you find out 30 days later when certs expire.
- Root causes ranked by frequency:
- Port mismatch between probe definition and
--healthz-portarg (most common after Helm chart version bumps) - Upstream backend not ready — the controller is healthy but a default backend Service has no Endpoints
externalTrafficPolicy: Local— node has no local endpoints, kube-proxy drops the probe- Resource starvation — OOMKilled nginx worker, controller returns 503 on
/healthzwhile restarting - Webhook misconfiguration —
ValidatingWebhookConfigurationrejecting ingress objects, controller enters error state
- Port mismatch between probe definition and
How to Fix It
Basic Fix — Verify and Align Probe Port
The most common cause: the Helm chart or manual YAML has the readiness probe hitting the wrong port after an upgrade changed --healthz-port.
# ingress-nginx controller Deployment
spec:
containers:
- name: controller
args:
- /nginx-ingress-controller
- - --healthz-port=10254
+ - --healthz-port=10254 # confirm this matches probe below
readinessProbe:
httpGet:
path: /healthz
- port: 80
+ port: 10254
initialDelaySeconds: 10
- periodSeconds: 5
+ periodSeconds: 10
failureThreshold: 3
Verify what port the controller is actually binding:
kubectl exec -n ingress-nginx <pod> -- ss -tlnp | grep nginx
# or
kubectl exec -n ingress-nginx <pod> -- curl -sv http://localhost:10254/healthz
If curl returns 200, the probe port is wrong. If it returns 503, the controller itself is unhealthy — check logs:
kubectl logs -n ingress-nginx <pod> --previous
kubectl logs -n ingress-nginx <pod> | grep -E 'error|FATAL|backend'
Enterprise Best Practice — Hardened Probe Config + Default Backend Validation
# values.yaml (Helm ingress-nginx)
controller:
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
- initialDelaySeconds: 10
+ initialDelaySeconds: 30 # give nginx workers time to fork on large nodes
- periodSeconds: 5
+ periodSeconds: 10
- failureThreshold: 3
+ failureThreshold: 6 # tolerate transient upstream blips during rollout
successThreshold: 1
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
# Ensure default backend Deployment is healthy BEFORE controller rolls out
defaultBackend:
enabled: true
replicaCount: 2 # was 1 — single replica = SPOF during its own rollout
# Prevent simultaneous replica loss
minReadySeconds: 15
strategy:
rollingUpdate:
- maxUnavailable: 1
+ maxUnavailable: 0
- maxSurge: 1
+ maxSurge: 1
If using externalTrafficPolicy: Local, the readiness probe from kubelet originates from the node IP. If no ingress-nginx pod is scheduled on that node, kube-proxy has no local endpoint and drops the probe. Fix:
# Service for ingress-nginx controller
spec:
- externalTrafficPolicy: Local
+ externalTrafficPolicy: Cluster # unless you specifically need src IP preservation
Or if you must keep Local, ensure podAntiAffinity is NOT preventing scheduling on probe-originating nodes.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Helm Chart Lint + Conftest Policy
Add an OPA/Conftest policy that enforces readiness probe port matches --healthz-port arg:
# policy/ingress_probe.rego
package ingress_nginx
deny[msg] {
container := input.spec.template.spec.containers[_]
container.name == "controller"
probe_port := container.readinessProbe.httpGet.port
healthz_arg := [a | a := container.args[_]; startswith(a, "--healthz-port=")][0]
expected_port := to_number(split(healthz_arg, "=")[1])
probe_port != expected_port
msg := sprintf("Readiness probe port %v does not match --healthz-port %v", [probe_port, expected_port])
}
# In CI pipeline
helm template ingress-nginx ingress-nginx/ingress-nginx -f values.yaml | \
conftest test - --policy policy/
2. Checkov Static Analysis
checkov -f controller-deployment.yaml \
--check CKV_K8S_8 # readiness probe defined
--check CKV_K8S_9 # liveness probe defined
3. Pre-Rollout Smoke Test in ArgoCD / Flux
Add a PreSync Job or ArgoCD Resource Hook that validates the default backend is healthy before the controller rolls:
apiVersion: batch/v1
kind: Job
metadata:
name: pre-rollout-backend-check
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: check
image: curlimages/curl:8.7.1
command:
- sh
- -c
- |
curl -sf http://ingress-nginx-defaultbackend.ingress-nginx.svc.cluster.local:8080/healthz \
|| (echo "Default backend not ready. Blocking rollout."; exit 1)
restartPolicy: Never
4. Monitor with Prometheus Alert
# PrometheusRule
- alert: IngressNginxControllerNotReady
expr: kube_pod_container_status_ready{namespace="ingress-nginx",container="controller"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "All ingress-nginx controller pods are NotReady — cluster ingress is down"