Why does my Kubernetes rollout get stuck at '1 of 3 updated replicas are available' even though the pod shows Running?

A pod in Running state does not mean it's Available. A pod is counted as Available only after it passes its readiness probe and remains healthy for minReadySeconds. If the liveness probe is killing and restarting the container before it passes readiness, the pod oscillates between Running and restarting — never reaching Available. The Deployment controller won't proceed with the rollout because doing so would violate the maxUnavailable budget.

What is the difference between startupProbe, livenessProbe, and readinessProbe in Kubernetes?

startupProbe gates the other two probes — it gives a slow-starting container a fixed window to boot before liveness kicks in. livenessProbe detects runtime failures like deadlocks and restarts the container if it fails. readinessProbe controls whether the pod receives traffic from Services — a failing readiness probe removes the pod from load balancer rotation without restarting it. Using only a livenessProbe with a short initialDelaySeconds conflates all three concerns and causes exactly this rollout failure pattern.

How do I calculate the correct initialDelaySeconds for my liveness probe?

Run your container locally or in a staging environment and time how long it takes from container start to the /healthz endpoint returning 200 under cold conditions — no warm JVM, no pre-populated cache, cold DB connection pool. Take the p99 of that measurement across 5–10 runs. Add a 15–20 second buffer. That's your initialDelaySeconds. If your startup time is variable or exceeds 60 seconds, switch to startupProbe with failureThreshold * periodSeconds covering your worst-case startup window instead.

How to Fix a Stuck Kubernetes Rollout: Liveness Probe Killing Pods During Deployment Progressing

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: New pods in the rolling update are being killed by the liveness probe before the application finishes booting, so minReadySeconds and availableReplicas never advance past 1.
How to fix it: Increase initialDelaySeconds on the liveness probe to exceed your application's actual startup time, and set failureThreshold high enough to survive transient boot delays.
Fast path: Use our Client-Side Sandbox above to paste your Deployment YAML and auto-generate the corrected probe configuration without sending your config off-machine.

The Incident

Your kubectl rollout status is hanging:

Waiting for deployment "api-server" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "api-server" rollout to finish: 1 of 3 updated replicas are available...

And kubectl describe pod <new-pod> shows:

Warning  Unhealthy  12s   kubelet  Liveness probe failed: HTTP probe failed with statuscode: 000
Warning  Killing    10s   kubelet  Container api-server failed liveness probe, will be restarted

Immediate consequence: The rolling update's maxUnavailable budget is exhausted. Kubernetes will not terminate old pods because new pods never reach Available. Traffic continues hitting stale pods — or worse, if old pods were already partially terminated, you have reduced capacity serving production.

The Attack Vector / Blast Radius

This is a cascading availability failure, not a one-pod problem.

The kill chain:

Rollout spawns a new pod. App starts booting (JVM warmup, DB connection pool init, cache hydration — pick your poison).
Liveness probe fires at initialDelaySeconds (e.g., 5s). App isn't ready. Probe returns non-2xx or times out.
After failureThreshold consecutive failures, kubelet kills and restarts the container.
The pod never exits the Running/Not Ready state long enough to be counted as Available.
The Deployment controller sees availableReplicas < desiredReplicas - maxUnavailable and freezes the rollout.
If your progressDeadlineSeconds (default: 600s) expires, the Deployment is marked False/ProgressDeadlineExceeded — but by then you've already been paged.

Blast radius scales with your replica count and maxSurge config. On a 20-replica Deployment with maxSurge: 5, you can have 5 CrashLooping pods simultaneously hammering your database with reconnection storms while serving zero traffic.

How to Fix It

Basic Fix: Correct `initialDelaySeconds` and `failureThreshold`

First, measure your actual startup time:

# Watch pod events in real time
kubectl get events --field-selector involvedObject.name=<pod-name> --watch

# Or time the readiness gate directly
kubectl get pod <pod-name> -w

Then patch the probe:

       livenessProbe:
         httpGet:
           path: /healthz
           port: 8080
-        initialDelaySeconds: 5
-        periodSeconds: 5
-        failureThreshold: 3
+        initialDelaySeconds: 45
+        periodSeconds: 10
+        failureThreshold: 6
         timeoutSeconds: 3

Rule of thumb: initialDelaySeconds = (p99 cold-start time) + 15s buffer. Never guess. Measure.

Enterprise Best Practice: Separate Startup, Liveness, and Readiness Probes

Using a single liveness probe for both startup detection and runtime health is the root cause of 80% of these incidents. Kubernetes 1.16+ provides startupProbe explicitly for this.

     containers:
     - name: api-server
       image: myapp:2.1.0
+      startupProbe:
+        httpGet:
+          path: /healthz/startup
+          port: 8080
+        failureThreshold: 30      # 30 * 10s = 5 min max startup window
+        periodSeconds: 10
       livenessProbe:
         httpGet:
           path: /healthz/live
           port: 8080
-        initialDelaySeconds: 5
-        periodSeconds: 5
-        failureThreshold: 3
+        initialDelaySeconds: 0    # startupProbe gates this; no delay needed
+        periodSeconds: 15
+        failureThreshold: 3
         timeoutSeconds: 3
       readinessProbe:
         httpGet:
-          path: /healthz
+          path: /healthz/ready    # Separate endpoint: checks DB, cache, dependencies
           port: 8080
-        initialDelaySeconds: 10
+        initialDelaySeconds: 0
         periodSeconds: 5
         failureThreshold: 3

Why this matters: startupProbe disables the liveness probe until it succeeds. Your app gets the full failureThreshold * periodSeconds window to boot. After that, liveness takes over with tight thresholds for runtime deadlock detection. Readiness independently gates traffic. These are three different failure modes and must be three different probes.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing Deployment YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored probe configuration using your own API key.

Prevention in CI/CD

1. OPA/Gatekeeper Policy — Block deployments missing `startupProbe`

package kubernetes.liveness

violation[{"msg": msg}] {
  container := input.review.object.spec.template.spec.containers[_]
  container.livenessProbe
  not container.startupProbe
  msg := sprintf("Container '%v' has livenessProbe but no startupProbe. This will cause rollout failures on slow-starting apps.", [container.name])
}

2. Checkov — Scan YAML in your PR pipeline

checkov -f deployment.yaml --check CKV_K8S_8  # liveness probe check
checkov -f deployment.yaml --check CKV_K8S_9  # readiness probe check

3. `kubectl rollout` timeout gate in your CD pipeline

# In your GitHub Actions / ArgoCD post-sync hook
kubectl rollout status deployment/api-server --timeout=5m || {
  echo "Rollout failed. Triggering automatic rollback."
  kubectl rollout undo deployment/api-server
  exit 1
}

4. Enforce `progressDeadlineSeconds` explicitly

Do not rely on the 600s default. Set it to match your SLO:

 spec:
   replicas: 3
+  progressDeadlineSeconds: 180
   strategy:
     type: RollingUpdate

This ensures your alerting fires before the outage drags on for 10 minutes silently.