Initializing Enclave...

How to Fix Kubernetes CronJob 'Exceeds Quota' Error: Namespace ResourceQuota Troubleshooting Guide

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: The namespace ResourceQuota has no remaining headroom (CPU, memory, pod count, or object count). The CronJob controller cannot CREATE the pod, so the job is silently dropped — no retry, no alert unless you're watching events.
  • How to fix it: Either raise the quota on the namespace, reduce the CronJob's resource requests, or clean up completed/failed jobs consuming quota slots.
  • Shortcut: Use our Client-Side Sandbox above to paste your CronJob YAML + kubectl describe quota output and auto-generate the refactored manifests.

The Incident (What Does the Error Mean?)

Raw event from kubectl describe cronjob <name> -n <namespace> or kubectl get events -n <namespace>:

Warning  FailedCreate  CronJob/cronjob-report-1712345678  
Error creating: pods "cronjob-report-1712345678-x9kzp" is forbidden: 
exceeds quota: namespace-quota, requested: requests.cpu=500m, 
used: requests.cpu=9800m, limited: requests.cpu=10000m

Immediate consequence: The CronJob schedule fires on time, the Job object is created, but the Pod spawn is rejected by the API server admission controller at the ResourceQuota admission plugin stage. The Job records a BackoffLimitExceeded or simply stays in a 0/1 active state. Your batch workload did not run. There is no automatic reschedule until the next cron tick.


The Attack Vector / Blast Radius

This is not just a scheduling nuisance — it is a silent data pipeline failure:

  • Cascading missed jobs: If the quota isn't resolved before the next cron tick, startingDeadlineSeconds kicks in. If enough ticks are missed (default threshold: 100), the CronJob controller stops scheduling entirely and requires a manual intervention to reset.
  • No built-in alerting: Kubernetes does not emit a metric or fire a default alert for quota-blocked CronJobs. Without kube-state-metrics + Alertmanager rules on kube_resourcequota and kube_cronjob_status_active, this is invisible.
  • Quota starvation from zombie jobs: Completed and failed Job pods are counted against pods and requests.* quota until they are garbage collected. A misconfigured ttlSecondsAfterFinished or absent failedJobsHistoryLimit bleeds quota over time.
  • Multi-tenant blast radius: In shared namespaces, one runaway Deployment or HPA scale-out can starve all CronJobs in the namespace simultaneously.

How to Fix It

Step 1 — Diagnose the Exact Quota Dimension

# See current usage vs hard limits
kubectl describe quota -n <namespace>

# See which pods are consuming the most requests
kubectl get pods -n <namespace> -o json | \
  jq '.items[] | {name: .metadata.name, cpu: .spec.containers[].resources.requests.cpu, mem: .spec.containers[].resources.requests.memory}'

# See zombie completed jobs eating quota
kubectl get pods -n <namespace> --field-selector=status.phase=Succeeded
kubectl get pods -n <namespace> --field-selector=status.phase=Failed

Basic Fix — Clean Up + Reduce Requests

# Delete completed job pods immediately
kubectl delete jobs --field-selector status.successful=1 -n <namespace>

# Patch the ResourceQuota if you own the namespace
kubectl patch resourcequota namespace-quota -n <namespace> \
  --patch '{"spec":{"hard":{"requests.cpu":"20000m"}}}'

Enterprise Best Practice — Fix the CronJob Manifest

The root cause is almost always missing or oversized resource requests. Right-size them using VPA recommendations or kubectl top pods.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cronjob-report
  namespace: production
spec:
  schedule: "0 2 * * *"
+  startingDeadlineSeconds: 300
+  concurrencyPolicy: Forbid
+  successfulJobsHistoryLimit: 1
+  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
+      ttlSecondsAfterFinished: 600
      template:
        spec:
          containers:
          - name: report-runner
            image: company/report:1.4.2
            resources:
-             requests:
-               cpu: "500m"
-               memory: "512Mi"
-             limits:
-               cpu: "2000m"
-               memory: "2Gi"
+             requests:
+               cpu: "100m"
+               memory: "128Mi"
+             limits:
+               cpu: "500m"
+               memory: "256Mi"

Why each change matters:

  • ttlSecondsAfterFinished: 600 — Garbage collects finished Job pods after 10 minutes, returning quota headroom automatically.
  • successfulJobsHistoryLimit: 1 / failedJobsHistoryLimit: 1 — Prevents accumulation of dead Job objects.
  • concurrencyPolicy: Forbid — Prevents overlapping runs from double-consuming quota.
  • startingDeadlineSeconds: 300 — Limits the window for late starts; prevents the 100-missed-schedule lockout.
  • Right-sized requests — The admission controller checks requests, not limits. Oversized requests are the #1 cause of premature quota exhaustion.

Enterprise Best Practice — Namespace Quota with Burst Headroom

apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: production
spec:
  hard:
-   requests.cpu: "10000m"
-   requests.memory: "20Gi"
-   pods: "20"
+   requests.cpu: "20000m"
+   requests.memory: "40Gi"
+   pods: "50"
+   count/jobs.batch: "10"
+   count/cronjobs.batch: "20"

Pair this with a LimitRange to enforce default requests on any container that omits them:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:
      cpu: "200m"
      memory: "256Mi"
    defaultRequest:
      cpu: "50m"
      memory: "64Mi"

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Conftest / OPA — Block Missing Resource Requests at PR Time

# policy/cronjob_resources.rego
package main

deny[msg] {
  input.kind == "CronJob"
  container := input.spec.jobTemplate.spec.template.spec.containers[_]
  not container.resources.requests.cpu
  msg := sprintf("CronJob '%v': container '%v' is missing resources.requests.cpu", [input.metadata.name, container.name])
}

deny[msg] {
  input.kind == "CronJob"
  not input.spec.ttlSecondsAfterFinished
  msg := sprintf("CronJob '%v': ttlSecondsAfterFinished must be set to prevent quota leakage", [input.metadata.name])
}
# In your CI pipeline
conftest test cronjob.yaml --policy policy/

2. Checkov — Static Analysis

checkov -f cronjob.yaml --check CKV_K8S_11,CKV_K8S_12,CKV_K8S_13
# CKV_K8S_11: CPU requests set
# CKV_K8S_12: Memory requests set  
# CKV_K8S_13: CPU limits set

3. Alertmanager Rule — Fire Before Quota Hits 100%

- alert: NamespaceQuotaCPUUsageHigh
  expr: |
    kube_resourcequota{resource="requests.cpu", type="used"} /
    kube_resourcequota{resource="requests.cpu", type="hard"} > 0.85
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Namespace {{ $labels.namespace }} CPU quota above 85%"
    description: "CronJobs will start failing at 100%. Used: {{ $value | humanizePercentage }}"

4. Kyverno Policy — Enforce ttlSecondsAfterFinished

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-job-ttl
spec:
  validationFailureAction: enforce
  rules:
  - name: check-ttl
    match:
      resources:
        kinds: [CronJob]
    validate:
      message: "CronJob must set spec.jobTemplate.spec.ttlSecondsAfterFinished"
      pattern:
        spec:
          jobTemplate:
            spec:
              ttlSecondsAfterFinished: "?*"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →