Fixing KubeVirt 'VMI Not Ready' Virt-Launcher Pod Crashes: Root Cause & Recovery Guide
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause
TL;DR
- What broke: The
virt-launcherpod for your KubeVirtVirtualMachineInstancecrashed or was OOMKilled, leaving the VMI in a perpetualSchedulingorPendingphase withVMI not readysurfaced by the virt-controller. - How to fix it: Identify the crash reason via
kubectl logsandkubectl describeon the virt-launcher pod — fix is one of: bump resource limits, patch the VMI spec's CPU/memory, correct the PVC binding, or resolve a libvirt/QEMU socket permission issue. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing VMI YAML — paste it in, get the corrected spec back without sending your config to a third-party server.
The Incident (What Does the Error Mean?)
Raw signal you'll see across the stack:
# kubectl get vmi <vmi-name> -n <namespace>
NAME AGE PHASE IP NODENAME READY
my-vm-0 4m Scheduling False
# kubectl describe vmi my-vm-0 -n <namespace>
...
Warning SyncFailed 3m virt-controller Error creating virt-launcher pod: ...
Warning VMINotReady 2m virt-handler VMI not ready: virt-launcher pod is not running
# kubectl describe pod virt-launcher-my-vm-0-<hash> -n <namespace>
Last State: Terminated
Reason: OOMKilled (or Error, CrashLoopBackOff)
Exit Code: 137
Immediate consequence: The VM never boots. The VMI object exists but no QEMU process was ever spawned. If this is a StatefulSet-backed VM or a KubeVirt VirtualMachine with runStrategy: Always, the controller will keep respawning the pod, hammering the node with failed pod launches until you intervene.
The Attack Vector / Blast Radius
This is not a single-pod problem. Here's the cascade:
- CrashLoopBackOff storm:
virt-controllerrespawns virt-launcher on each reconcile loop. Each failed pod leaves ephemeral storage debris and consumes API server write bandwidth. - Node resource starvation: OOMKilled launchers that die mid-QEMU-init can leave orphaned
/dev/kvmfile descriptors and leaked tap interfaces (k6t-eth0) on the node. Confirmed viaip link show | grep k6t. - Noisy neighbor impact: On a shared node, repeated pod churn from a single broken VMI elevates kubelet reconcile pressure for every other workload on that node.
- Data integrity risk: If the VMI was attached to a
ReadWriteOncePVC that didn't cleanly detach before the crash, the PVC can remain in a locked state, blocking any future VMI from mounting it — effectively bricking the volume until manual intervention.
Primary root causes ranked by frequency:
| Cause | Signal |
|---|---|
| OOMKill (memory limit too low) | Exit Code 137, Reason: OOMKilled |
| Missing or unbound PVC | FailedMount, PVC in Pending |
| libvirt socket permission / securityContext | permission denied in virt-launcher logs |
| Node doesn't support KVM (nested virt disabled) | /dev/kvm no such file |
| CPU model incompatibility | QEMU unsupported machine type |
How to Fix It (The Solution)
Step 1 — Pull the actual crash reason
# Get the terminated container logs (not the running one)
kubectl logs virt-launcher-my-vm-0-<hash> -n <namespace> --previous -c compute
# Also check virt-handler on the affected node
kubectl logs -n kubevirt -l kubevirt.io=virt-handler --field-selector spec.nodeName=<node>
Basic Fix — OOMKill: Raise Memory Limits on the VMI
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: my-vm-0
spec:
domain:
resources:
requests:
- memory: 64M
+ memory: 2Gi
limits:
- memory: 128M
+ memory: 2Gi
cpu:
cores: 1
Rule:
virt-launcheritself needs overhead on top of the guest RAM. KubeVirt adds an overhead buffer automatically, but if you hard-caplimits.memorybelowrequests.memory + overhead, the pod is killed before QEMU initializes. Never set memory limits below 1Gi for any non-trivial guest.
Basic Fix — Unbound PVC
kubectl get pvc -n <namespace>
# If STATUS = Pending, your StorageClass is not provisioning
kubectl describe pvc <pvc-name> -n <namespace>
volumes:
- name: rootdisk
persistentVolumeClaim:
- claimName: my-vm-pvc-typo
+ claimName: my-vm-rootdisk-pvc
Enterprise Best Practice — Enforce Resource Floors with LimitRange + VMI Validation Webhook
Problem: Individual teams submit VMI specs with dangerously low memory. No guardrails exist at admission time.
Solution: Deploy a LimitRange in each VM namespace AND enforce a KubeVirt ValidatingWebhookConfiguration (or OPA/Gatekeeper policy) that rejects VMIs below safe thresholds.
# LimitRange for VM namespaces
apiVersion: v1
kind: LimitRange
metadata:
name: kubevirt-vm-limits
namespace: vm-workloads
spec:
limits:
- # No limits defined — any value accepted
+ - type: Container
+ min:
+ memory: 512Mi
+ cpu: "250m"
+ default:
+ memory: 2Gi
+ cpu: "1"
+ defaultRequest:
+ memory: 2Gi
+ cpu: "1"
# OPA Rego — Gatekeeper ConstraintTemplate excerpt
- # No policy — virt-launcher OOMKills silently
+ deny[msg] {
+ input.review.object.kind == "VirtualMachineInstance"
+ mem := input.review.object.spec.domain.resources.requests.memory
+ to_number(trim_suffix(mem, "Mi")) < 512
+ msg := "VMI memory request below 512Mi — virt-launcher will OOMKill"
+ }
Fix — KVM Not Available on Node
# On the suspect node
ls -la /dev/kvm
# If missing: nested virtualization is disabled in the hypervisor or BIOS
# For AWS (enable nested virt on metal instances only — .metal instance types)
# For GKE, use n2 or c2 node pools with --enable-nested-virtualization
# Verify KubeVirt node labels
kubectl get nodes -l kubevirt.io/schedulable=true
# If your VMI was scheduled to a non-KVM node due to missing nodeSelector:
spec:
+ nodeSelector:
+ kubevirt.io/schedulable: "true"
domain:
...
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing VMI YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
Gate broken VMI specs before they hit the cluster.
1. Checkov — Static VMI YAML Scanning
pip install checkov
checkov -f vmi-manifest.yaml --framework kubernetes
# Custom check: flag any VMI with memory request < 512Mi
2. Kyverno Policy — Enforce KVM Node Affinity
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-kvm-node-selector
spec:
validationFailureAction: enforce
rules:
- name: check-kvm-nodeselector
match:
resources:
kinds: [VirtualMachineInstance]
validate:
message: "VMI must target kubevirt.io/schedulable=true nodes"
pattern:
spec:
nodeSelector:
kubevirt.io/schedulable: "true"
3. Pre-commit Hook — Validate PVC References Exist
#!/bin/bash
# pre-commit: validate PVC names in VMI specs match existing PVCs
PVC_NAMES=$(kubectl get pvc -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')
REF=$(yq e '.spec.volumes[].persistentVolumeClaim.claimName' vmi-manifest.yaml)
if [[ ! " $PVC_NAMES " =~ " $REF " ]]; then
echo "ERROR: PVC '$REF' not found in namespace $NAMESPACE"
exit 1
fi
4. Alerting — PagerDuty/AlertManager Rule
- alert: KubeVirtVMINotReady
expr: kubevirt_vmi_phase_count{phase="Scheduling"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "VMI stuck in Scheduling phase for >5min — likely virt-launcher crash"
runbook: "https://your-wiki/kubevirt-virt-launcher-crash"