Why does exit code 137 specifically indicate OOMKilled in Kubernetes?

Exit code 137 equals 128 plus signal number 9 (SIGKILL). When the Linux kernel's OOM killer terminates a container for exceeding its cgroup memory limit, it sends SIGKILL. Kubernetes surfaces this as exit code 137 and sets the pod's Last State reason to 'OOMKilled'. Any other SIGKILL source (manual kill -9, Fargate spot interruption) also produces 137, so always cross-reference the 'Reason: OOMKilled' field in kubectl describe to confirm memory is the cause.

Why does OOMKilled happen more aggressively on EKS Fargate than on EC2 node groups?

On EC2-backed node groups, the node has physical memory that can absorb temporary bursts beyond a single container's limit before the OOM killer activates, and kubelet has some latitude in eviction ordering. On Fargate, each pod runs in a dedicated microVM sized exactly to the pod's resource requests (rounded to the nearest valid Fargate tier). There is no shared node memory pool. The cgroup hard limit is the VM's total memory, so any breach triggers an immediate kernel OOM kill with zero tolerance for burst.

How do I find the right memory limit value without repeatedly OOMKilling in production?

Use kubectl top pod to get current working-set memory, then query CloudWatch Container Insights for the historical peak over the last 7–30 days using the metric pod_memory_working_set_bytes. Set resources.limits.memory to at least 150% of the observed peak. Deploy a VerticalPodAutoscaler in 'Off' (recommendation-only) mode for 48–72 hours under representative load, then read its recommendations via kubectl describe vpa. Never set limits equal to requests on Fargate — always leave headroom for JVM GC overhead, Node.js V8 heap, or glibc malloc arena growth.

Fixing Kubernetes OOMKilled Exit Code 137 CrashLoopBackOff on EKS Fargate Spot Instances

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Your container exceeded its resources.limits.memory ceiling. The Linux OOM killer sent SIGKILL (signal 9), producing exit code 137. Fargate's per-pod VM boundary made memory overcommit impossible, so the kill was immediate.
How to fix it: Raise resources.limits.memory to match actual working-set usage (get this from kubectl top pod), align requests to avoid Fargate vCPU/memory ratio mismatches, and add a VPA in recommendation mode.
Shortcut: Use our Client-Side Sandbox below to auto-refactor your pod spec — paste your YAML and get corrected limits without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw output from a live cluster:

$ kubectl get pods -n production
NAME                          READY   STATUS             RESTARTS   AGE
api-deployment-7d9f8b-xk2pq   0/1     CrashLoopBackOff   8          18m

$ kubectl describe pod api-deployment-7d9f8b-xk2pq -n production
...
Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
  Started:   Mon, 10 Jun 2024 03:12:44 +0000
  Finished:  Mon, 10 Jun 2024 03:12:51 +0000
Restart Count: 8
...
Limits:
  memory: 256Mi
Requests:
  memory: 128Mi

Exit code 137 = 128 + 9. Signal 9 is SIGKILL, issued by the kernel OOM killer the moment the container's RSS exceeded the cgroup memory limit. On EKS Fargate, each pod runs in an isolated microVM. There is no node-level memory to borrow. The moment your container's working set crosses limits.memory, the kill is unconditional and instantaneous — no grace period, no swap.

CrashLoopBackOff is Kubernetes applying exponential back-off (10s → 20s → 40s → … → 5min cap) because the container keeps dying on startup or under load. After 8 restarts, your back-off window is near maximum. The service is effectively down.

The Attack Vector / Blast Radius

This is not just a single pod dying. Trace the cascade:

Pod restart storm: Each restart re-pulls config, re-establishes DB connections, and re-warms caches. Under load, the pod dies before warm-up completes, causing the next restart to OOMKill even faster.
Fargate Spot eviction amplifier: Spot capacity interruptions can coincide with OOM events. If your Fargate Spot profile has no on-demand fallback, the replacement pod may never schedule.
HPA death spiral: If a Horizontal Pod Autoscaler is attached, it sees low READY pod count and scales up. New pods also OOMKill. You burn Fargate vCPU/memory allocation costs at scale while serving zero traffic.
Fargate memory-to-vCPU ratio constraint: Fargate enforces specific vCPU/memory combinations. A limits.memory: 256Mi with requests.cpu: 1 is an invalid combination — Fargate silently rounds up memory, but your limit stays at 256Mi, so the container gets killed inside a larger-than-requested VM. This is the most common hidden cause of OOMKill on Fargate that engineers miss.
Downstream dependency timeouts: Any service calling this pod accumulates open connections during CrashLoopBackOff. If your upstream has a short connect timeout, you get cascading 503s across your mesh.

The blast radius on a production API with 8 restarts already logged: full service outage for all traffic hitting this deployment.

How to Fix It

Step 1: Get Actual Memory Usage First

# Requires metrics-server or CloudWatch Container Insights enabled
kubectl top pod -n production --sort-by=memory

# For historical peak usage (CloudWatch Insights query):
fields @timestamp, pod_name, pod_memory_working_set_bytes
| filter ClusterName = 'your-cluster'
| stats max(pod_memory_working_set_bytes) by pod_name

Set limits.memory to at least 1.5× your observed peak working set. Never guess.

Basic Fix — Corrected Resource Block

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: api-deployment
 spec:
   template:
     spec:
       containers:
       - name: api
         image: myrepo/api:v2.1.0
         resources:
           requests:
-            memory: "128Mi"
-            cpu: "250m"
+            memory: "512Mi"
+            cpu: "500m"
           limits:
-            memory: "256Mi"
-            cpu: "500m"
+            memory: "1Gi"
+            cpu: "1000m"

Why requests matter on Fargate: Fargate allocates the microVM based on requests, not limits. If requests.memory is 128Mi, Fargate provisions a 0.25 vCPU / 0.5GB VM tier. Your limit of 256Mi fits, but leaves zero headroom for JVM heap expansion, glibc malloc arenas, or Node.js V8 heap growth under load.

Enterprise Best Practice — VPA + LimitRange + Fargate Profile Alignment

+apiVersion: autoscaling.k8s.io/v1
+kind: VerticalPodAutoscaler
+metadata:
+  name: api-deployment-vpa
+  namespace: production
+spec:
+  targetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: api-deployment
+  updatePolicy:
+    updateMode: "Off"   # Recommendation mode only — apply manually first
+  resourcePolicy:
+    containerPolicies:
+    - containerName: api
+      minAllowed:
+        memory: 256Mi
+      maxAllowed:
+        memory: 4Gi
+      controlledResources: ["memory"]
---
+apiVersion: v1
+kind: LimitRange
+metadata:
+  name: fargate-memory-guardrails
+  namespace: production
+spec:
+  limits:
+  - type: Container
+    default:
+      memory: 1Gi
+    defaultRequest:
+      memory: 512Mi
+    max:
+      memory: 4Gi
+    min:
+      memory: 256Mi

Fargate profile — ensure your memory value is a valid Fargate tier:

 # eksctl cluster config or Terraform aws_eks_fargate_profile
 # Valid Fargate memory values: 512MB, 1GB, 2GB, 3GB, 4GB ... 30GB (in 1GB increments above 512MB)
-# requests.memory: 128Mi  ← rounds to 512MB VM, limit of 256Mi causes OOMKill in a 512MB VM
+# requests.memory: 512Mi  ← explicit 512MB tier, limit of 1Gi safely within tier

Check valid combinations: 0.25 vCPU → 0.5–2GB, 0.5 vCPU → 1–4GB, 1 vCPU → 2–8GB.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov Policy — Block Deployments Without Memory Limits

# .checkov.yml
checks:
  - CKV_K8S_11   # CPU limits set
  - CKV_K8S_13   # Memory limits set
  - CKV_K8S_12   # CPU requests set
  - CKV_K8S_14   # Memory requests set

Run in your pipeline:

checkov -d ./k8s-manifests --framework kubernetes --check CKV_K8S_11,CKV_K8S_13

2. OPA Gatekeeper ConstraintTemplate — Enforce Fargate-Valid Memory Ratios

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: fargatememoryratiocheck
spec:
  crd:
    spec:
      names:
        kind: FargateMemoryRatioCheck
  targets:
  - target: admission.k8s.gatekeeper.sh
    rego: |
      package fargatememoryratiocheck
      violation[{"msg": msg}] {
        container := input.review.object.spec.containers[_]
        limit := container.resources.limits.memory
        request := container.resources.requests.memory
        # Limit must not exceed 4x request to prevent Fargate tier mismatch
        to_number(limit) > to_number(request) * 4
        msg := sprintf("Container %v: memory limit (%v) exceeds 4x request (%v). Fargate tier mismatch risk.", [container.name, limit, request])
      }

3. GitHub Actions Gate

- name: Validate K8s Manifests
  uses: instrumenta/kubeval-action@master
  with:
    files: k8s/

- name: Checkov Scan
  uses: bridgecrewio/checkov-action@master
  with:
    directory: k8s/
    framework: kubernetes
    soft_fail: false

4. CloudWatch Alarm — Alert Before OOMKill Hits

aws cloudwatch put-metric-alarm \
  --alarm-name "EKS-Fargate-MemoryPressure-api" \
  --metric-name pod_memory_working_set_bytes \
  --namespace ContainerInsights \
  --statistic Average \
  --period 60 \
  --threshold 858993459 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:ACCOUNT:ops-alerts
  # Threshold = 820Mi (80% of 1Gi limit) — page before kernel kills

The rule: If your p95 memory usage is within 20% of limits.memory, you are one traffic spike away from this incident repeating.