Initializing Enclave...

Fixing KubeVirt 'VMI Not Ready' Virt-Launcher Pod Crashes: Root Cause & Recovery Guide

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause

TL;DR

  • What broke: The virt-launcher pod for your KubeVirt VirtualMachineInstance crashed or was OOMKilled, leaving the VMI in a perpetual Scheduling or Pending phase with VMI not ready surfaced by the virt-controller.
  • How to fix it: Identify the crash reason via kubectl logs and kubectl describe on the virt-launcher pod — fix is one of: bump resource limits, patch the VMI spec's CPU/memory, correct the PVC binding, or resolve a libvirt/QEMU socket permission issue.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing VMI YAML — paste it in, get the corrected spec back without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw signal you'll see across the stack:

# kubectl get vmi <vmi-name> -n <namespace>
NAME         AGE   PHASE     IP    NODENAME   READY
my-vm-0      4m    Scheduling                  False

# kubectl describe vmi my-vm-0 -n <namespace>
...
Warning  SyncFailed   3m   virt-controller  Error creating virt-launcher pod: ...
Warning  VMINotReady  2m   virt-handler     VMI not ready: virt-launcher pod is not running

# kubectl describe pod virt-launcher-my-vm-0-<hash> -n <namespace>
  Last State: Terminated
    Reason:   OOMKilled (or Error, CrashLoopBackOff)
    Exit Code: 137

Immediate consequence: The VM never boots. The VMI object exists but no QEMU process was ever spawned. If this is a StatefulSet-backed VM or a KubeVirt VirtualMachine with runStrategy: Always, the controller will keep respawning the pod, hammering the node with failed pod launches until you intervene.


The Attack Vector / Blast Radius

This is not a single-pod problem. Here's the cascade:

  1. CrashLoopBackOff storm: virt-controller respawns virt-launcher on each reconcile loop. Each failed pod leaves ephemeral storage debris and consumes API server write bandwidth.
  2. Node resource starvation: OOMKilled launchers that die mid-QEMU-init can leave orphaned /dev/kvm file descriptors and leaked tap interfaces (k6t-eth0) on the node. Confirmed via ip link show | grep k6t.
  3. Noisy neighbor impact: On a shared node, repeated pod churn from a single broken VMI elevates kubelet reconcile pressure for every other workload on that node.
  4. Data integrity risk: If the VMI was attached to a ReadWriteOnce PVC that didn't cleanly detach before the crash, the PVC can remain in a locked state, blocking any future VMI from mounting it — effectively bricking the volume until manual intervention.

Primary root causes ranked by frequency:

Cause Signal
OOMKill (memory limit too low) Exit Code 137, Reason: OOMKilled
Missing or unbound PVC FailedMount, PVC in Pending
libvirt socket permission / securityContext permission denied in virt-launcher logs
Node doesn't support KVM (nested virt disabled) /dev/kvm no such file
CPU model incompatibility QEMU unsupported machine type

How to Fix It (The Solution)

Step 1 — Pull the actual crash reason

# Get the terminated container logs (not the running one)
kubectl logs virt-launcher-my-vm-0-<hash> -n <namespace> --previous -c compute

# Also check virt-handler on the affected node
kubectl logs -n kubevirt -l kubevirt.io=virt-handler --field-selector spec.nodeName=<node>

Basic Fix — OOMKill: Raise Memory Limits on the VMI

 apiVersion: kubevirt.io/v1
 kind: VirtualMachineInstance
 metadata:
   name: my-vm-0
 spec:
   domain:
     resources:
       requests:
-        memory: 64M
+        memory: 2Gi
       limits:
-        memory: 128M
+        memory: 2Gi
     cpu:
       cores: 1

Rule: virt-launcher itself needs overhead on top of the guest RAM. KubeVirt adds an overhead buffer automatically, but if you hard-cap limits.memory below requests.memory + overhead, the pod is killed before QEMU initializes. Never set memory limits below 1Gi for any non-trivial guest.


Basic Fix — Unbound PVC

kubectl get pvc -n <namespace>
# If STATUS = Pending, your StorageClass is not provisioning
kubectl describe pvc <pvc-name> -n <namespace>
 volumes:
   - name: rootdisk
     persistentVolumeClaim:
-      claimName: my-vm-pvc-typo
+      claimName: my-vm-rootdisk-pvc

Enterprise Best Practice — Enforce Resource Floors with LimitRange + VMI Validation Webhook

Problem: Individual teams submit VMI specs with dangerously low memory. No guardrails exist at admission time.

Solution: Deploy a LimitRange in each VM namespace AND enforce a KubeVirt ValidatingWebhookConfiguration (or OPA/Gatekeeper policy) that rejects VMIs below safe thresholds.

# LimitRange for VM namespaces
 apiVersion: v1
 kind: LimitRange
 metadata:
   name: kubevirt-vm-limits
   namespace: vm-workloads
 spec:
   limits:
-  # No limits defined — any value accepted
+  - type: Container
+    min:
+      memory: 512Mi
+      cpu: "250m"
+    default:
+      memory: 2Gi
+      cpu: "1"
+    defaultRequest:
+      memory: 2Gi
+      cpu: "1"
# OPA Rego — Gatekeeper ConstraintTemplate excerpt
- # No policy — virt-launcher OOMKills silently
+ deny[msg] {
+   input.review.object.kind == "VirtualMachineInstance"
+   mem := input.review.object.spec.domain.resources.requests.memory
+   to_number(trim_suffix(mem, "Mi")) < 512
+   msg := "VMI memory request below 512Mi — virt-launcher will OOMKill"
+ }

Fix — KVM Not Available on Node

# On the suspect node
ls -la /dev/kvm
# If missing: nested virtualization is disabled in the hypervisor or BIOS

# For AWS (enable nested virt on metal instances only — .metal instance types)
# For GKE, use n2 or c2 node pools with --enable-nested-virtualization

# Verify KubeVirt node labels
kubectl get nodes -l kubevirt.io/schedulable=true
# If your VMI was scheduled to a non-KVM node due to missing nodeSelector:
 spec:
+  nodeSelector:
+    kubevirt.io/schedulable: "true"
   domain:
     ...

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing VMI YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

Gate broken VMI specs before they hit the cluster.

1. Checkov — Static VMI YAML Scanning

pip install checkov
checkov -f vmi-manifest.yaml --framework kubernetes
# Custom check: flag any VMI with memory request < 512Mi

2. Kyverno Policy — Enforce KVM Node Affinity

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-kvm-node-selector
spec:
  validationFailureAction: enforce
  rules:
    - name: check-kvm-nodeselector
      match:
        resources:
          kinds: [VirtualMachineInstance]
      validate:
        message: "VMI must target kubevirt.io/schedulable=true nodes"
        pattern:
          spec:
            nodeSelector:
              kubevirt.io/schedulable: "true"

3. Pre-commit Hook — Validate PVC References Exist

#!/bin/bash
# pre-commit: validate PVC names in VMI specs match existing PVCs
PVC_NAMES=$(kubectl get pvc -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')
REF=$(yq e '.spec.volumes[].persistentVolumeClaim.claimName' vmi-manifest.yaml)
if [[ ! " $PVC_NAMES " =~ " $REF " ]]; then
  echo "ERROR: PVC '$REF' not found in namespace $NAMESPACE"
  exit 1
fi

4. Alerting — PagerDuty/AlertManager Rule

- alert: KubeVirtVMINotReady
  expr: kubevirt_vmi_phase_count{phase="Scheduling"} > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "VMI stuck in Scheduling phase for >5min — likely virt-launcher crash"
    runbook: "https://your-wiki/kubevirt-virt-launcher-crash"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →