How to Fix Kubernetes Job Pod 'Back-off Restarting Failed Container' with Completed Status but Failed Exit Code
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: Your Kubernetes Job pod is reporting
Back-off restarting failed container— the container exited with a non-zero code, butkubectl get podshowsCompletedorRunningstatus, masking the real failure from your monitoring stack. - How to fix it: Audit the exit code via
kubectl describe pod, correctrestartPolicytoNeverorOnFailure, set a sanebackoffLimit, and fix the underlying container entrypoint or resource starvation causing the non-zero exit. - Fast path: Use our Client-Side Sandbox above to paste your Job manifest and get auto-refactored YAML with the correct restart policy, backoff limit, and resource envelope — no data leaves your browser.
The Incident (What Does This Error Mean?)
Raw signal from kubectl describe pod <job-pod>:
Warning BackOff 3m kubelet Back-off restarting failed container
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 10 Jun 2024 03:12:44 +0000
Finished: Mon, 06 Jun 2024 03:12:45 +0000
The kubelet is restarting the container on an exponential backoff (10s → 20s → 40s → … → 5min cap) because the container process exited non-zero. The status field lying as Completed is the real trap — it happens when the Job controller sees the pod phase as Succeeded prematurely, or your tooling reads .status.phase instead of .status.containerStatuses[].state.terminated.exitCode. Your job silently failed. Downstream consumers got no data. No alert fired.
Exit code cheat sheet:
| Exit Code | Meaning |
|---|---|
1 |
Generic application error / unhandled exception |
2 |
Misuse of shell built-ins |
126 |
Permission denied / not executable |
127 |
Command not found (bad entrypoint/image) |
137 |
OOMKilled (SIGKILL, memory limit breached) |
139 |
Segfault (SIGSEGV) |
143 |
Graceful SIGTERM (often preStop timeout exceeded) |
The Attack Vector / Blast Radius
This is not just a nuisance — it is a silent data pipeline failure with compounding blast radius:
- Ghost completions:
kubectl get jobsshows1/1completions. Your SRE team marks the incident resolved. The ETL never ran. Database is stale. - Runaway cost from backoff loops: Before the backoff cap hits 5 minutes, Kubernetes has already scheduled and torn down the container N times per
backoffLimit. AtbackoffLimit: 6(default), you burned 6× the pod startup cost — including pulling images, mounting volumes, and acquiring node resources. restartPolicy: Alwayson a Job is a ticking bomb: If someone copy-pasted a Deployment template into a Job spec, the kubelet restarts indefinitely. The Job controller never marks it failed. This runs forever and consumes node CPU/memory until the node is evicted or the cluster hits resource quota.- OOMKilled (exit 137) cascade: A memory-starved job pod getting OOMKilled repeatedly can trigger node-level memory pressure, causing the kubelet to evict other pods on the same node — including your stateful workloads.
- Missed
activeDeadlineSeconds: Without a deadline, a stuck job holds its PVC, ServiceAccount token, and any external locks (DB advisory locks, distributed mutexes) indefinitely.
How to Fix It
Step 1: Get the Real Exit Code
# Get exit code — don't trust .status.phase
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'
# Full forensics
kubectl describe pod <pod-name> | grep -A 10 "Last State"
# Logs from the dead container (not the current one)
kubectl logs <pod-name> --previous
Basic Fix — Correct restartPolicy and backoffLimit
apiVersion: batch/v1
kind: Job
metadata:
name: data-processor
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
template:
spec:
- restartPolicy: Always
+ restartPolicy: Never
containers:
- name: processor
image: my-org/processor:1.4.2
- # No resource limits set
+ resources:
+ requests:
+ memory: "256Mi"
+ cpu: "250m"
+ limits:
+ memory: "512Mi"
+ cpu: "500m"
Why restartPolicy: Never over OnFailure? With Never, each retry creates a new pod, giving you a full log history per attempt. With OnFailure, the pod is restarted in-place and --previous logs may be overwritten. For debugging production failures, Never is non-negotiable.
Enterprise Best Practice — Full Hardened Job Spec
apiVersion: batch/v1
kind: Job
metadata:
name: data-processor
+ annotations:
+ checkov.io/skip1: "CKV_K8S_28=acknowledged: no seccomp needed for this workload"
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
+ ttlSecondsAfterFinished: 3600
template:
spec:
- restartPolicy: Always
+ restartPolicy: Never
+ automountServiceAccountToken: false
+ securityContext:
+ runAsNonRoot: true
+ runAsUser: 10001
+ seccompProfile:
+ type: RuntimeDefault
containers:
- name: processor
image: my-org/processor:1.4.2
+ imagePullPolicy: IfNotPresent
+ securityContext:
+ allowPrivilegeEscalation: false
+ readOnlyRootFilesystem: true
+ capabilities:
+ drop: ["ALL"]
resources:
- # missing
+ requests:
+ memory: "256Mi"
+ cpu: "250m"
+ limits:
+ memory: "512Mi"
+ cpu: "500m"
+ livenessProbe: null # Never add liveness probes to Job pods — kubelet will kill them mid-run
Critical notes:
- Never add
livenessProbeto a Job pod. If your job runs longer thaninitialDelaySeconds + failureThreshold × periodSeconds, the kubelet kills it. This is the #1 cause of exit code137on jobs that "worked in staging." ttlSecondsAfterFinishedprevents dead pod accumulation that exhausts etcd object count limits at scale.activeDeadlineSecondsis your circuit breaker. Set it to 2× your p99 job runtime.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Block Bad Job Specs at PR Time
# .checkov.yml
checks:
- CKV_K8S_6 # Do not admit root containers
- CKV_K8S_14 # Image tag should not be latest
- CKV_K8S_28 # Seccomp profile
- CKV_K8S_43 # Image should use digest
checkov -f job.yaml --check CKV_K8S_6,CKV_K8S_14
2. OPA/Gatekeeper — Enforce restartPolicy: Never on All Jobs
package kubernetes.jobs
violation[{"msg": msg}] {
input.request.kind.kind == "Job"
policy := input.request.object.spec.template.spec.restartPolicy
policy != "Never"
msg := sprintf("Job '%v' must use restartPolicy: Never. Got: %v", [input.request.object.metadata.name, policy])
}
3. GitHub Actions — Validate Job Manifests Pre-Deploy
- name: Validate Job Manifests
run: |
kubeval --strict jobs/*.yaml
checkov -d jobs/ --framework kubernetes --soft-fail-on MEDIUM
4. Alerting — Stop Trusting .status.phase
Prometheus alert that catches what kubectl get jobs hides:
- alert: KubernetesJobFailed
expr: kube_job_status_failed > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Job {{ $labels.job_name }} has failed pods"
description: "Check exit codes: kubectl get pod -l job-name={{ $labels.job_name }} -o jsonpath='{.items[*].status.containerStatuses[*].state.terminated.exitCode}'"
Do not alert on kube_job_status_succeeded alone. A job can show 1 success and 3 failures simultaneously if backoffLimit wasn't exhausted.