How to Fix a Stuck Kubernetes Rollout: Liveness Probe Killing Pods During Deployment Progressing
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: New pods in the rolling update are being killed by the liveness probe before the application finishes booting, so
minReadySecondsandavailableReplicasnever advance past 1. - How to fix it: Increase
initialDelaySecondson the liveness probe to exceed your application's actual startup time, and setfailureThresholdhigh enough to survive transient boot delays. - Fast path: Use our Client-Side Sandbox above to paste your Deployment YAML and auto-generate the corrected probe configuration without sending your config off-machine.
The Incident
Your kubectl rollout status is hanging:
Waiting for deployment "api-server" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "api-server" rollout to finish: 1 of 3 updated replicas are available...
And kubectl describe pod <new-pod> shows:
Warning Unhealthy 12s kubelet Liveness probe failed: HTTP probe failed with statuscode: 000
Warning Killing 10s kubelet Container api-server failed liveness probe, will be restarted
Immediate consequence: The rolling update's maxUnavailable budget is exhausted. Kubernetes will not terminate old pods because new pods never reach Available. Traffic continues hitting stale pods — or worse, if old pods were already partially terminated, you have reduced capacity serving production.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a one-pod problem.
The kill chain:
- Rollout spawns a new pod. App starts booting (JVM warmup, DB connection pool init, cache hydration — pick your poison).
- Liveness probe fires at
initialDelaySeconds(e.g., 5s). App isn't ready. Probe returns non-2xx or times out. - After
failureThresholdconsecutive failures, kubelet kills and restarts the container. - The pod never exits the
Running/Not Readystate long enough to be counted asAvailable. - The Deployment controller sees
availableReplicas < desiredReplicas - maxUnavailableand freezes the rollout. - If your
progressDeadlineSeconds(default: 600s) expires, the Deployment is markedFalse/ProgressDeadlineExceeded— but by then you've already been paged.
Blast radius scales with your replica count and maxSurge config. On a 20-replica Deployment with maxSurge: 5, you can have 5 CrashLooping pods simultaneously hammering your database with reconnection storms while serving zero traffic.
How to Fix It
Basic Fix: Correct initialDelaySeconds and failureThreshold
First, measure your actual startup time:
# Watch pod events in real time
kubectl get events --field-selector involvedObject.name=<pod-name> --watch
# Or time the readiness gate directly
kubectl get pod <pod-name> -w
Then patch the probe:
livenessProbe:
httpGet:
path: /healthz
port: 8080
- initialDelaySeconds: 5
- periodSeconds: 5
- failureThreshold: 3
+ initialDelaySeconds: 45
+ periodSeconds: 10
+ failureThreshold: 6
timeoutSeconds: 3
Rule of thumb: initialDelaySeconds = (p99 cold-start time) + 15s buffer. Never guess. Measure.
Enterprise Best Practice: Separate Startup, Liveness, and Readiness Probes
Using a single liveness probe for both startup detection and runtime health is the root cause of 80% of these incidents. Kubernetes 1.16+ provides startupProbe explicitly for this.
containers:
- name: api-server
image: myapp:2.1.0
+ startupProbe:
+ httpGet:
+ path: /healthz/startup
+ port: 8080
+ failureThreshold: 30 # 30 * 10s = 5 min max startup window
+ periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
- initialDelaySeconds: 5
- periodSeconds: 5
- failureThreshold: 3
+ initialDelaySeconds: 0 # startupProbe gates this; no delay needed
+ periodSeconds: 15
+ failureThreshold: 3
timeoutSeconds: 3
readinessProbe:
httpGet:
- path: /healthz
+ path: /healthz/ready # Separate endpoint: checks DB, cache, dependencies
port: 8080
- initialDelaySeconds: 10
+ initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 3
Why this matters: startupProbe disables the liveness probe until it succeeds. Your app gets the full failureThreshold * periodSeconds window to boot. After that, liveness takes over with tight thresholds for runtime deadlock detection. Readiness independently gates traffic. These are three different failure modes and must be three different probes.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing Deployment YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored probe configuration using your own API key.
Prevention in CI/CD
1. OPA/Gatekeeper Policy — Block deployments missing startupProbe
package kubernetes.liveness
violation[{"msg": msg}] {
container := input.review.object.spec.template.spec.containers[_]
container.livenessProbe
not container.startupProbe
msg := sprintf("Container '%v' has livenessProbe but no startupProbe. This will cause rollout failures on slow-starting apps.", [container.name])
}
2. Checkov — Scan YAML in your PR pipeline
checkov -f deployment.yaml --check CKV_K8S_8 # liveness probe check
checkov -f deployment.yaml --check CKV_K8S_9 # readiness probe check
3. kubectl rollout timeout gate in your CD pipeline
# In your GitHub Actions / ArgoCD post-sync hook
kubectl rollout status deployment/api-server --timeout=5m || {
echo "Rollout failed. Triggering automatic rollback."
kubectl rollout undo deployment/api-server
exit 1
}
4. Enforce progressDeadlineSeconds explicitly
Do not rely on the 600s default. Set it to match your SLO:
spec:
replicas: 3
+ progressDeadlineSeconds: 180
strategy:
type: RollingUpdate
This ensures your alerting fires before the outage drags on for 10 minutes silently.