Fixing Karpenter 'No Instance Types Available' Spot Capacity Errors: A Production Debugging Guide
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Karpenter cannot find any EC2 Spot capacity matching your NodePool's instance type constraints in the requested Availability Zone(s), leaving pods in
Pendingindefinitely. - How to fix it: Broaden instance family diversity, add multi-AZ spread, and configure an on-demand fallback weight so Karpenter has an escape hatch when Spot pools dry up.
- Shortcut: Use our Client-Side Sandbox above to paste your NodePool YAML and controller logs — it will auto-refactor the instance requirements and capacity fallback config without sending your data anywhere.
The Incident (What Does the Error Mean?)
You'll see this in kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter:
ERROR controller.provisioner no instance types available {"commit": "abc1234", "provisioner": "default"}
ERROR controller.nodeclaim failed to launch nodeclaim {"error": "no capacity available for node request"}
And pods sit here forever:
kubectl get pods -A | grep Pending
my-namespace worker-7d9f6 0/1 Pending 0 18m
Immediate consequence: Karpenter's reconciler loop keeps retrying, burning controller CPU, while your workload is completely unscheduled. If this is a job queue or autoscaling event triggered by real traffic, you are dropping requests right now.
The Attack Vector / Blast Radius
Spot capacity is regional, AZ-specific, and instance-family-specific. AWS can reclaim or simply never offer a Spot pool for a given instance type in a given AZ at any moment. When your NodePool is too narrow — e.g., pinned to m5.xlarge only in us-east-1b — you've created a single point of failure against AWS's own capacity scheduler.
Cascading failure chain:
- Spot pool for
m5.xlargeinus-east-1bdries up (common during AZ-level demand spikes). - Karpenter finds zero valid instance types. Emits the error. Does nothing.
- HPA has already scaled your Deployment replicas up. All new pods are
Pending. - Your application's readiness probe starts failing on existing pods under increased load.
- ALB target group health drops. 502s begin.
- If you have a PodDisruptionBudget, Karpenter may also be blocked from consolidating existing nodes, compounding cost AND availability impact simultaneously.
The secondary blast: Engineers start manually launching On-Demand nodes or disabling Karpenter consolidation as a hotfix, leaving orphaned over-provisioned nodes that silently inflate your EC2 bill for weeks.
How to Fix It
Basic Fix — Broaden Instance Diversity
The minimum viable fix is to stop pinning to a single instance family and let Karpenter use karpenter.k8s.aws/instance-category with multiple families.
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- - key: node.kubernetes.io/instance-type
- operator: In
- values: ["m5.xlarge"]
- - key: topology.kubernetes.io/zone
- operator: In
- values: ["us-east-1b"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
+ - key: karpenter.k8s.aws/instance-category
+ operator: In
+ values: ["m", "c", "r"]
+ - key: karpenter.k8s.aws/instance-generation
+ operator: Gt
+ values: ["4"]
+ - key: topology.kubernetes.io/zone
+ operator: In
+ values: ["us-east-1a", "us-east-1b", "us-east-1c"]
+ - key: karpenter.sh/capacity-type
+ operator: In
+ values: ["spot", "on-demand"]
Enterprise Best Practice — Weighted On-Demand Fallback + Disruption Budgets
For production, you need two NodePools: one Spot-optimized, one On-Demand fallback with a higher weight so Karpenter prefers Spot but can always fall back. Combine with disruption budgets to prevent thrashing during Spot reclamation events.
# NodePool 1: Spot (preferred)
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-workers
spec:
+ weight: 100
disruption:
+ consolidationPolicy: WhenUnderutilized
+ consolidateAfter: 30s
+ budgets:
+ - nodes: "20%" # Never consolidate more than 20% of nodes at once
template:
spec:
requirements:
+ - key: karpenter.sh/capacity-type
+ operator: In
+ values: ["spot"]
+ - key: karpenter.k8s.aws/instance-category
+ operator: In
+ values: ["m", "c", "r"]
+ - key: karpenter.k8s.aws/instance-generation
+ operator: Gt
+ values: ["4"]
+ - key: topology.kubernetes.io/zone
+ operator: In
+ values: ["us-east-1a", "us-east-1b", "us-east-1c"]
---
# NodePool 2: On-Demand fallback
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: ondemand-fallback
spec:
+ weight: 1 # Only used when spot-workers NodePool cannot schedule
template:
spec:
requirements:
+ - key: karpenter.sh/capacity-type
+ operator: In
+ values: ["on-demand"]
+ - key: karpenter.k8s.aws/instance-category
+ operator: In
+ values: ["m", "c"]
+ - key: topology.kubernetes.io/zone
+ operator: In
+ values: ["us-east-1a", "us-east-1b", "us-east-1c"]
Also verify your EC2NodeClass has correct amiSelectorTerms and subnetSelectorTerms — a misconfigured NodeClass silently eliminates valid instance types before Karpenter even evaluates capacity:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster" # Must match actual subnet tags
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. OPA/Gatekeeper Policy — Enforce Instance Diversity
Block any NodePool that specifies fewer than 3 instance categories or a single AZ:
package karpenter.nodepool
deny[msg] {
input.kind == "NodePool"
reqs := input.spec.template.spec.requirements
zone_req := reqs[_]
zone_req.key == "topology.kubernetes.io/zone"
zone_req.operator == "In"
count(zone_req.values) < 2
msg := "NodePool must span at least 2 Availability Zones to tolerate Spot capacity outages."
}
deny[msg] {
input.kind == "NodePool"
reqs := input.spec.template.spec.requirements
type_req := reqs[_]
type_req.key == "node.kubernetes.io/instance-type"
type_req.operator == "In"
count(type_req.values) < 3
msg := "NodePool pinned to fewer than 3 instance types. Spot interruptions will cause scheduling failures."
}
2. Checkov Custom Check
Add to your Terraform/Helm pipeline:
# checkov custom check: CKV_KARPENTER_001
# Validates NodePool has capacity-type including on-demand fallback OR weight-based fallback pool exists
Run in CI:
checkov -d ./helm/karpenter --framework kubernetes --check CKV_KARPENTER_001
3. CloudWatch Alarm on Pending Pod Duration
Don't wait for user reports. Alert when pods are pending > 5 minutes:
aws cloudwatch put-metric-alarm \
--alarm-name "karpenter-pending-pods-breach" \
--metric-name "pending_pods" \
--namespace "ContainerInsights" \
--statistic Average \
--period 300 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789:oncall-pagerduty
4. Karpenter Spot Interruption Queue
Ensure you have the SQS interruption queue configured. Without it, Karpenter has no advance warning of Spot reclamation and cannot proactively drain nodes before the 2-minute termination window:
# In your Karpenter Helm values
interruptionQueue: "my-cluster-karpenter-interruption-queue"
This alone won't prevent the no instance types available error, but it dramatically reduces the blast radius when Spot IS reclaimed by giving Karpenter time to reschedule before node death.