How to Fix Kubernetes CronJob 'Exceeds Quota' Error: Namespace ResourceQuota Troubleshooting Guide
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: The namespace
ResourceQuotahas no remaining headroom (CPU, memory, pod count, or object count). The CronJob controller cannotCREATEthe pod, so the job is silently dropped — no retry, no alert unless you're watching events. - How to fix it: Either raise the quota on the namespace, reduce the CronJob's resource
requests, or clean up completed/failed jobs consuming quota slots. - Shortcut: Use our Client-Side Sandbox above to paste your CronJob YAML +
kubectl describe quotaoutput and auto-generate the refactored manifests.
The Incident (What Does the Error Mean?)
Raw event from kubectl describe cronjob <name> -n <namespace> or kubectl get events -n <namespace>:
Warning FailedCreate CronJob/cronjob-report-1712345678
Error creating: pods "cronjob-report-1712345678-x9kzp" is forbidden:
exceeds quota: namespace-quota, requested: requests.cpu=500m,
used: requests.cpu=9800m, limited: requests.cpu=10000m
Immediate consequence: The CronJob schedule fires on time, the Job object is created, but the Pod spawn is rejected by the API server admission controller at the ResourceQuota admission plugin stage. The Job records a BackoffLimitExceeded or simply stays in a 0/1 active state. Your batch workload did not run. There is no automatic reschedule until the next cron tick.
The Attack Vector / Blast Radius
This is not just a scheduling nuisance — it is a silent data pipeline failure:
- Cascading missed jobs: If the quota isn't resolved before the next cron tick,
startingDeadlineSecondskicks in. If enough ticks are missed (default threshold: 100), the CronJob controller stops scheduling entirely and requires a manual intervention to reset. - No built-in alerting: Kubernetes does not emit a metric or fire a default alert for quota-blocked CronJobs. Without
kube-state-metrics+ Alertmanager rules onkube_resourcequotaandkube_cronjob_status_active, this is invisible. - Quota starvation from zombie jobs: Completed and failed Job pods are counted against
podsandrequests.*quota until they are garbage collected. A misconfiguredttlSecondsAfterFinishedor absentfailedJobsHistoryLimitbleeds quota over time. - Multi-tenant blast radius: In shared namespaces, one runaway Deployment or HPA scale-out can starve all CronJobs in the namespace simultaneously.
How to Fix It
Step 1 — Diagnose the Exact Quota Dimension
# See current usage vs hard limits
kubectl describe quota -n <namespace>
# See which pods are consuming the most requests
kubectl get pods -n <namespace> -o json | \
jq '.items[] | {name: .metadata.name, cpu: .spec.containers[].resources.requests.cpu, mem: .spec.containers[].resources.requests.memory}'
# See zombie completed jobs eating quota
kubectl get pods -n <namespace> --field-selector=status.phase=Succeeded
kubectl get pods -n <namespace> --field-selector=status.phase=Failed
Basic Fix — Clean Up + Reduce Requests
# Delete completed job pods immediately
kubectl delete jobs --field-selector status.successful=1 -n <namespace>
# Patch the ResourceQuota if you own the namespace
kubectl patch resourcequota namespace-quota -n <namespace> \
--patch '{"spec":{"hard":{"requests.cpu":"20000m"}}}'
Enterprise Best Practice — Fix the CronJob Manifest
The root cause is almost always missing or oversized resource requests. Right-size them using VPA recommendations or kubectl top pods.
apiVersion: batch/v1
kind: CronJob
metadata:
name: cronjob-report
namespace: production
spec:
schedule: "0 2 * * *"
+ startingDeadlineSeconds: 300
+ concurrencyPolicy: Forbid
+ successfulJobsHistoryLimit: 1
+ failedJobsHistoryLimit: 1
jobTemplate:
spec:
+ ttlSecondsAfterFinished: 600
template:
spec:
containers:
- name: report-runner
image: company/report:1.4.2
resources:
- requests:
- cpu: "500m"
- memory: "512Mi"
- limits:
- cpu: "2000m"
- memory: "2Gi"
+ requests:
+ cpu: "100m"
+ memory: "128Mi"
+ limits:
+ cpu: "500m"
+ memory: "256Mi"
Why each change matters:
ttlSecondsAfterFinished: 600— Garbage collects finished Job pods after 10 minutes, returning quota headroom automatically.successfulJobsHistoryLimit: 1/failedJobsHistoryLimit: 1— Prevents accumulation of dead Job objects.concurrencyPolicy: Forbid— Prevents overlapping runs from double-consuming quota.startingDeadlineSeconds: 300— Limits the window for late starts; prevents the 100-missed-schedule lockout.- Right-sized
requests— The admission controller checksrequests, notlimits. Oversized requests are the #1 cause of premature quota exhaustion.
Enterprise Best Practice — Namespace Quota with Burst Headroom
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
namespace: production
spec:
hard:
- requests.cpu: "10000m"
- requests.memory: "20Gi"
- pods: "20"
+ requests.cpu: "20000m"
+ requests.memory: "40Gi"
+ pods: "50"
+ count/jobs.batch: "10"
+ count/cronjobs.batch: "20"
Pair this with a LimitRange to enforce default requests on any container that omits them:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default:
cpu: "200m"
memory: "256Mi"
defaultRequest:
cpu: "50m"
memory: "64Mi"
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Conftest / OPA — Block Missing Resource Requests at PR Time
# policy/cronjob_resources.rego
package main
deny[msg] {
input.kind == "CronJob"
container := input.spec.jobTemplate.spec.template.spec.containers[_]
not container.resources.requests.cpu
msg := sprintf("CronJob '%v': container '%v' is missing resources.requests.cpu", [input.metadata.name, container.name])
}
deny[msg] {
input.kind == "CronJob"
not input.spec.ttlSecondsAfterFinished
msg := sprintf("CronJob '%v': ttlSecondsAfterFinished must be set to prevent quota leakage", [input.metadata.name])
}
# In your CI pipeline
conftest test cronjob.yaml --policy policy/
2. Checkov — Static Analysis
checkov -f cronjob.yaml --check CKV_K8S_11,CKV_K8S_12,CKV_K8S_13
# CKV_K8S_11: CPU requests set
# CKV_K8S_12: Memory requests set
# CKV_K8S_13: CPU limits set
3. Alertmanager Rule — Fire Before Quota Hits 100%
- alert: NamespaceQuotaCPUUsageHigh
expr: |
kube_resourcequota{resource="requests.cpu", type="used"} /
kube_resourcequota{resource="requests.cpu", type="hard"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Namespace {{ $labels.namespace }} CPU quota above 85%"
description: "CronJobs will start failing at 100%. Used: {{ $value | humanizePercentage }}"
4. Kyverno Policy — Enforce ttlSecondsAfterFinished
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-job-ttl
spec:
validationFailureAction: enforce
rules:
- name: check-ttl
match:
resources:
kinds: [CronJob]
validate:
message: "CronJob must set spec.jobTemplate.spec.ttlSecondsAfterFinished"
pattern:
spec:
jobTemplate:
spec:
ttlSecondsAfterFinished: "?*"