Initializing Enclave...

How to Fix a Stuck Kubernetes Rollout: Liveness Probe Killing Pods During Deployment Progressing

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: New pods in the rolling update are being killed by the liveness probe before the application finishes booting, so minReadySeconds and availableReplicas never advance past 1.
  • How to fix it: Increase initialDelaySeconds on the liveness probe to exceed your application's actual startup time, and set failureThreshold high enough to survive transient boot delays.
  • Fast path: Use our Client-Side Sandbox above to paste your Deployment YAML and auto-generate the corrected probe configuration without sending your config off-machine.

The Incident

Your kubectl rollout status is hanging:

Waiting for deployment "api-server" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "api-server" rollout to finish: 1 of 3 updated replicas are available...

And kubectl describe pod <new-pod> shows:

Warning  Unhealthy  12s   kubelet  Liveness probe failed: HTTP probe failed with statuscode: 000
Warning  Killing    10s   kubelet  Container api-server failed liveness probe, will be restarted

Immediate consequence: The rolling update's maxUnavailable budget is exhausted. Kubernetes will not terminate old pods because new pods never reach Available. Traffic continues hitting stale pods — or worse, if old pods were already partially terminated, you have reduced capacity serving production.


The Attack Vector / Blast Radius

This is a cascading availability failure, not a one-pod problem.

The kill chain:

  1. Rollout spawns a new pod. App starts booting (JVM warmup, DB connection pool init, cache hydration — pick your poison).
  2. Liveness probe fires at initialDelaySeconds (e.g., 5s). App isn't ready. Probe returns non-2xx or times out.
  3. After failureThreshold consecutive failures, kubelet kills and restarts the container.
  4. The pod never exits the Running/Not Ready state long enough to be counted as Available.
  5. The Deployment controller sees availableReplicas < desiredReplicas - maxUnavailable and freezes the rollout.
  6. If your progressDeadlineSeconds (default: 600s) expires, the Deployment is marked False/ProgressDeadlineExceeded — but by then you've already been paged.

Blast radius scales with your replica count and maxSurge config. On a 20-replica Deployment with maxSurge: 5, you can have 5 CrashLooping pods simultaneously hammering your database with reconnection storms while serving zero traffic.


How to Fix It

Basic Fix: Correct initialDelaySeconds and failureThreshold

First, measure your actual startup time:

# Watch pod events in real time
kubectl get events --field-selector involvedObject.name=<pod-name> --watch

# Or time the readiness gate directly
kubectl get pod <pod-name> -w

Then patch the probe:

       livenessProbe:
         httpGet:
           path: /healthz
           port: 8080
-        initialDelaySeconds: 5
-        periodSeconds: 5
-        failureThreshold: 3
+        initialDelaySeconds: 45
+        periodSeconds: 10
+        failureThreshold: 6
         timeoutSeconds: 3

Rule of thumb: initialDelaySeconds = (p99 cold-start time) + 15s buffer. Never guess. Measure.


Enterprise Best Practice: Separate Startup, Liveness, and Readiness Probes

Using a single liveness probe for both startup detection and runtime health is the root cause of 80% of these incidents. Kubernetes 1.16+ provides startupProbe explicitly for this.

     containers:
     - name: api-server
       image: myapp:2.1.0
+      startupProbe:
+        httpGet:
+          path: /healthz/startup
+          port: 8080
+        failureThreshold: 30      # 30 * 10s = 5 min max startup window
+        periodSeconds: 10
       livenessProbe:
         httpGet:
           path: /healthz/live
           port: 8080
-        initialDelaySeconds: 5
-        periodSeconds: 5
-        failureThreshold: 3
+        initialDelaySeconds: 0    # startupProbe gates this; no delay needed
+        periodSeconds: 15
+        failureThreshold: 3
         timeoutSeconds: 3
       readinessProbe:
         httpGet:
-          path: /healthz
+          path: /healthz/ready    # Separate endpoint: checks DB, cache, dependencies
           port: 8080
-        initialDelaySeconds: 10
+        initialDelaySeconds: 0
         periodSeconds: 5
         failureThreshold: 3

Why this matters: startupProbe disables the liveness probe until it succeeds. Your app gets the full failureThreshold * periodSeconds window to boot. After that, liveness takes over with tight thresholds for runtime deadlock detection. Readiness independently gates traffic. These are three different failure modes and must be three different probes.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing Deployment YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored probe configuration using your own API key.


Prevention in CI/CD

1. OPA/Gatekeeper Policy — Block deployments missing startupProbe

package kubernetes.liveness

violation[{"msg": msg}] {
  container := input.review.object.spec.template.spec.containers[_]
  container.livenessProbe
  not container.startupProbe
  msg := sprintf("Container '%v' has livenessProbe but no startupProbe. This will cause rollout failures on slow-starting apps.", [container.name])
}

2. Checkov — Scan YAML in your PR pipeline

checkov -f deployment.yaml --check CKV_K8S_8  # liveness probe check
checkov -f deployment.yaml --check CKV_K8S_9  # readiness probe check

3. kubectl rollout timeout gate in your CD pipeline

# In your GitHub Actions / ArgoCD post-sync hook
kubectl rollout status deployment/api-server --timeout=5m || {
  echo "Rollout failed. Triggering automatic rollback."
  kubectl rollout undo deployment/api-server
  exit 1
}

4. Enforce progressDeadlineSeconds explicitly

Do not rely on the 600s default. Set it to match your SLO:

 spec:
   replicas: 3
+  progressDeadlineSeconds: 180
   strategy:
     type: RollingUpdate

This ensures your alerting fires before the outage drags on for 10 minutes silently.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →