Why does my Kubernetes Job pod show 'Completed' status but still has a failed exit code?

This is a status reporting mismatch. The Kubernetes Job controller reads `.status.phase` at the pod level, which can show 'Succeeded' or 'Completed' if the pod *phase* transitioned correctly, while the container inside exited non-zero. Always validate with `kubectl get pod -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'` — never trust the phase field alone for job success determination.

What is the difference between restartPolicy: Never and restartPolicy: OnFailure for Kubernetes Jobs?

With `OnFailure`, the kubelet restarts the *same pod* in-place on failure, overwriting previous container logs and making forensics harder. With `Never`, the Job controller creates a *new pod* for each retry attempt, preserving full log history per attempt up to `backoffLimit`. For production jobs where debugging matters, `Never` is the correct choice. `OnFailure` is only appropriate for simple, idempotent jobs where log history is irrelevant.

How do I prevent a Kubernetes Job from running forever due to a misconfigured restartPolicy?

Set two hard limits in your Job spec: `backoffLimit` (e.g., 2–3) to cap retry attempts, and `activeDeadlineSeconds` to enforce a wall-clock timeout regardless of retry state. Example: `activeDeadlineSeconds: 300` kills the job after 5 minutes no matter what. Without `activeDeadlineSeconds`, a job with `restartPolicy: Always` (which should never be used on Jobs) or a hung process will consume node resources indefinitely.

How to Fix Kubernetes Job Pod 'Back-off Restarting Failed Container' with Completed Status but Failed Exit Code

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: Your Kubernetes Job pod is reporting Back-off restarting failed container — the container exited with a non-zero code, but kubectl get pod shows Completed or Running status, masking the real failure from your monitoring stack.
How to fix it: Audit the exit code via kubectl describe pod, correct restartPolicy to Never or OnFailure, set a sane backoffLimit, and fix the underlying container entrypoint or resource starvation causing the non-zero exit.
Fast path: Use our Client-Side Sandbox above to paste your Job manifest and get auto-refactored YAML with the correct restart policy, backoff limit, and resource envelope — no data leaves your browser.

The Incident (What Does This Error Mean?)

Raw signal from kubectl describe pod <job-pod>:

Warning  BackOff    3m    kubelet  Back-off restarting failed container
State:   Waiting
  Reason: CrashLoopBackOff
Last State: Terminated
  Reason:    Error
  Exit Code: 1
  Started:   Mon, 10 Jun 2024 03:12:44 +0000
  Finished:  Mon, 06 Jun 2024 03:12:45 +0000

The kubelet is restarting the container on an exponential backoff (10s → 20s → 40s → … → 5min cap) because the container process exited non-zero. The status field lying as Completed is the real trap — it happens when the Job controller sees the pod phase as Succeeded prematurely, or your tooling reads .status.phase instead of .status.containerStatuses[].state.terminated.exitCode. Your job silently failed. Downstream consumers got no data. No alert fired.

Exit code cheat sheet:

Exit Code	Meaning
`1`	Generic application error / unhandled exception
`2`	Misuse of shell built-ins
`126`	Permission denied / not executable
`127`	Command not found (bad entrypoint/image)
`137`	OOMKilled (SIGKILL, memory limit breached)
`139`	Segfault (SIGSEGV)
`143`	Graceful SIGTERM (often preStop timeout exceeded)

The Attack Vector / Blast Radius

This is not just a nuisance — it is a silent data pipeline failure with compounding blast radius:

Ghost completions: kubectl get jobs shows 1/1 completions. Your SRE team marks the incident resolved. The ETL never ran. Database is stale.
Runaway cost from backoff loops: Before the backoff cap hits 5 minutes, Kubernetes has already scheduled and torn down the container N times per backoffLimit. At backoffLimit: 6 (default), you burned 6× the pod startup cost — including pulling images, mounting volumes, and acquiring node resources.
restartPolicy: Always on a Job is a ticking bomb: If someone copy-pasted a Deployment template into a Job spec, the kubelet restarts indefinitely. The Job controller never marks it failed. This runs forever and consumes node CPU/memory until the node is evicted or the cluster hits resource quota.
OOMKilled (exit 137) cascade: A memory-starved job pod getting OOMKilled repeatedly can trigger node-level memory pressure, causing the kubelet to evict other pods on the same node — including your stateful workloads.
Missed activeDeadlineSeconds: Without a deadline, a stuck job holds its PVC, ServiceAccount token, and any external locks (DB advisory locks, distributed mutexes) indefinitely.

How to Fix It

Step 1: Get the Real Exit Code

# Get exit code — don't trust .status.phase
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].state.terminated.exitCode}'

# Full forensics
kubectl describe pod <pod-name> | grep -A 10 "Last State"

# Logs from the dead container (not the current one)
kubectl logs <pod-name> --previous

Basic Fix — Correct `restartPolicy` and `backoffLimit`

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
  template:
    spec:
-     restartPolicy: Always
+     restartPolicy: Never
      containers:
      - name: processor
        image: my-org/processor:1.4.2
-       # No resource limits set
+       resources:
+         requests:
+           memory: "256Mi"
+           cpu: "250m"
+         limits:
+           memory: "512Mi"
+           cpu: "500m"

Why restartPolicy: Never over OnFailure? With Never, each retry creates a new pod, giving you a full log history per attempt. With OnFailure, the pod is restarted in-place and --previous logs may be overwritten. For debugging production failures, Never is non-negotiable.

Enterprise Best Practice — Full Hardened Job Spec

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
+ annotations:
+   checkov.io/skip1: "CKV_K8S_28=acknowledged: no seccomp needed for this workload"
spec:
- backoffLimit: 6
+ backoffLimit: 2
+ activeDeadlineSeconds: 300
+ ttlSecondsAfterFinished: 3600
  template:
    spec:
-     restartPolicy: Always
+     restartPolicy: Never
+     automountServiceAccountToken: false
+     securityContext:
+       runAsNonRoot: true
+       runAsUser: 10001
+       seccompProfile:
+         type: RuntimeDefault
      containers:
      - name: processor
        image: my-org/processor:1.4.2
+       imagePullPolicy: IfNotPresent
+       securityContext:
+         allowPrivilegeEscalation: false
+         readOnlyRootFilesystem: true
+         capabilities:
+           drop: ["ALL"]
        resources:
-         # missing
+         requests:
+           memory: "256Mi"
+           cpu: "250m"
+         limits:
+           memory: "512Mi"
+           cpu: "500m"
+       livenessProbe: null  # Never add liveness probes to Job pods — kubelet will kill them mid-run

Critical notes:

Never add livenessProbe to a Job pod. If your job runs longer than initialDelaySeconds + failureThreshold × periodSeconds, the kubelet kills it. This is the #1 cause of exit code 137 on jobs that "worked in staging."
ttlSecondsAfterFinished prevents dead pod accumulation that exhausts etcd object count limits at scale.
activeDeadlineSeconds is your circuit breaker. Set it to 2× your p99 job runtime.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov — Block Bad Job Specs at PR Time

# .checkov.yml
checks:
  - CKV_K8S_6    # Do not admit root containers
  - CKV_K8S_14   # Image tag should not be latest
  - CKV_K8S_28   # Seccomp profile
  - CKV_K8S_43   # Image should use digest

checkov -f job.yaml --check CKV_K8S_6,CKV_K8S_14

2. OPA/Gatekeeper — Enforce `restartPolicy: Never` on All Jobs

package kubernetes.jobs

violation[{"msg": msg}] {
  input.request.kind.kind == "Job"
  policy := input.request.object.spec.template.spec.restartPolicy
  policy != "Never"
  msg := sprintf("Job '%v' must use restartPolicy: Never. Got: %v", [input.request.object.metadata.name, policy])
}

3. GitHub Actions — Validate Job Manifests Pre-Deploy

- name: Validate Job Manifests
  run: |
    kubeval --strict jobs/*.yaml
    checkov -d jobs/ --framework kubernetes --soft-fail-on MEDIUM

4. Alerting — Stop Trusting `.status.phase`

Prometheus alert that catches what kubectl get jobs hides:

- alert: KubernetesJobFailed
  expr: kube_job_status_failed > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Job {{ $labels.job_name }} has failed pods"
    description: "Check exit codes: kubectl get pod -l job-name={{ $labels.job_name }} -o jsonpath='{.items[*].status.containerStatuses[*].state.terminated.exitCode}'"

Do not alert on kube_job_status_succeeded alone. A job can show 1 success and 3 failures simultaneously if backoffLimit wasn't exhausted.