Initializing Enclave...

How to Fix Kubernetes Job Pod 'Back-off Restarting Failed Container' with Completed Status but Failed Exit Code

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: Your Kubernetes Job pod is reporting Back-off restarting failed container — the container exited with a non-zero code, but kubectl get pod shows Completed or Running status, masking the real failure from your monitoring stack.
  • How to fix it: Audit the exit code via kubectl describe pod, correct restartPolicy to Never or OnFailure, set a sane backoffLimit, and fix the underlying container entrypoint or resource starvation causing the non-zero exit.
  • Fast path: Use our Client-Side Sandbox above to paste your Job manifest and get auto-refactored YAML with the correct restart policy, backoff limit, and resource envelope — no data leaves your browser.

The Incident (What Does This Error Mean?)

Raw signal from kubectl describe pod <job-pod>:

Warning  BackOff    3m    kubelet  Back-off restarting failed container
State:   Waiting
  Reason: CrashLoopBackOff
Last State: Terminated
  Reason:    Error
  Exit Code: 1
  Started:   Mon, 10 Jun 2024 03:12:44 +0000
  Finished:  Mon, 06 Jun 2024 03:12:45 +0000

The kubelet is restarting the container on an exponential backoff (10s → 20s → 40s → … → 5min cap) because the container process exited non-zero. The status field lying as Completed is the real trap — it happens when the Job controller sees the pod phase as Succeeded prematurely, or your tooling reads .status.phase instead of .status.containerStatuses[].state.terminated.exitCode. Your job silently failed. Downstream consumers got no data. No alert fired.

Exit code cheat sheet:

Exit Code Meaning
1 Generic application error / unhandled exception
2 Misuse of shell built-ins
126 Permission denied / not executable
127 Command not found (bad entrypoint/image)
137 OOMKilled (SIGKILL, memory limit breached)
139 Segfault (SIGSEGV)
143 Graceful SIGTERM (often preStop timeout exceeded)

The Attack Vector / Blast Radius

This is not just a nuisance — it is a silent data pipeline failure with compounding blast radius:

  1. Ghost completions: kubectl get jobs shows 1/1 completions. Your SRE team marks the incident resolved. The ETL never ran. Database is stale.
  2. Runaway cost from backoff loops: Before the backoff cap hits 5 minutes, Kubernetes has already scheduled and torn down the container N times per backoffLimit. At backoffLimit: 6 (default), you burned 6× the pod startup cost — including pulling images, mounting volumes, and acquiring node resources.
  3. restartPolicy: Always on a Job is a ticking bomb: If someone copy-pasted a Deployment template into a Job spec, the kubelet restarts indefinitely. The Job controller never marks it failed. This runs forever and consumes node CPU/memory until the node is evicted or the cluster hits resource quota.
  4. OOMKilled (exit 137) cascade: A memory-starved job pod getting OOMKilled repeatedly can trigger node-level memory pressure, causing the kubelet to evict other pods on the same node — including your stateful workloads.
  5. Missed activeDeadlineSeconds: Without a deadline, a stuck job holds its PVC, ServiceAccount token, and any external locks (DB advisory locks, distributed mutexes) indefinitely.

How to Fix It

Step 1: Get the Real Exit Code

# Get exit code — don't trust .status.phase
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'

# Full forensics
kubectl describe pod <pod-name> | grep -A 10 "Last State"

# Logs from the dead container (not the current one)
kubectl logs <pod-name> --previous

Basic Fix — Correct restartPolicy and backoffLimit

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
  template:
    spec:
-     restartPolicy: Always
+     restartPolicy: Never
      containers:
      - name: processor
        image: my-org/processor:1.4.2
-       # No resource limits set
+       resources:
+         requests:
+           memory: "256Mi"
+           cpu: "250m"
+         limits:
+           memory: "512Mi"
+           cpu: "500m"

Why restartPolicy: Never over OnFailure? With Never, each retry creates a new pod, giving you a full log history per attempt. With OnFailure, the pod is restarted in-place and --previous logs may be overwritten. For debugging production failures, Never is non-negotiable.

Enterprise Best Practice — Full Hardened Job Spec

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
+ annotations:
+   checkov.io/skip1: "CKV_K8S_28=acknowledged: no seccomp needed for this workload"
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
+ ttlSecondsAfterFinished: 3600
  template:
    spec:
-     restartPolicy: Always
+     restartPolicy: Never
+     automountServiceAccountToken: false
+     securityContext:
+       runAsNonRoot: true
+       runAsUser: 10001
+       seccompProfile:
+         type: RuntimeDefault
      containers:
      - name: processor
        image: my-org/processor:1.4.2
+       imagePullPolicy: IfNotPresent
+       securityContext:
+         allowPrivilegeEscalation: false
+         readOnlyRootFilesystem: true
+         capabilities:
+           drop: ["ALL"]
        resources:
-         # missing
+         requests:
+           memory: "256Mi"
+           cpu: "250m"
+         limits:
+           memory: "512Mi"
+           cpu: "500m"
+       livenessProbe: null  # Never add liveness probes to Job pods — kubelet will kill them mid-run

Critical notes:

  • Never add livenessProbe to a Job pod. If your job runs longer than initialDelaySeconds + failureThreshold × periodSeconds, the kubelet kills it. This is the #1 cause of exit code 137 on jobs that "worked in staging."
  • ttlSecondsAfterFinished prevents dead pod accumulation that exhausts etcd object count limits at scale.
  • activeDeadlineSeconds is your circuit breaker. Set it to 2× your p99 job runtime.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov — Block Bad Job Specs at PR Time

# .checkov.yml
checks:
  - CKV_K8S_6    # Do not admit root containers
  - CKV_K8S_14   # Image tag should not be latest
  - CKV_K8S_28   # Seccomp profile
  - CKV_K8S_43   # Image should use digest
checkov -f job.yaml --check CKV_K8S_6,CKV_K8S_14

2. OPA/Gatekeeper — Enforce restartPolicy: Never on All Jobs

package kubernetes.jobs

violation[{"msg": msg}] {
  input.request.kind.kind == "Job"
  policy := input.request.object.spec.template.spec.restartPolicy
  policy != "Never"
  msg := sprintf("Job '%v' must use restartPolicy: Never. Got: %v", [input.request.object.metadata.name, policy])
}

3. GitHub Actions — Validate Job Manifests Pre-Deploy

- name: Validate Job Manifests
  run: |
    kubeval --strict jobs/*.yaml
    checkov -d jobs/ --framework kubernetes --soft-fail-on MEDIUM

4. Alerting — Stop Trusting .status.phase

Prometheus alert that catches what kubectl get jobs hides:

- alert: KubernetesJobFailed
  expr: kube_job_status_failed > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Job {{ $labels.job_name }} has failed pods"
    description: "Check exit codes: kubectl get pod -l job-name={{ $labels.job_name }} -o jsonpath='{.items[*].status.containerStatuses[*].state.terminated.exitCode}'"

Do not alert on kube_job_status_succeeded alone. A job can show 1 success and 3 failures simultaneously if backoffLimit wasn't exhausted.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →