Why does the virt-launcher pod get OOMKilled even when my VM's guest memory is small?

The virt-launcher pod memory limit must cover both the QEMU process overhead and the guest RAM. KubeVirt calculates an automatic overhead buffer (roughly 100–350Mi depending on version and configuration), but if you manually set a hard `limits.memory` in the VMI spec that is lower than guest RAM + overhead, the kernel OOMKills the pod before the guest even boots. Always set `limits.memory` equal to `requests.memory` and let KubeVirt add its overhead on top — never cap it below the guest size.

How do I recover a PVC that got stuck in a locked state after a virt-launcher crash?

First, confirm the VMI and its pod are fully deleted: `kubectl delete vmi ` and verify the virt-launcher pod is gone. Then check the PVC: `kubectl describe pvc `. If it shows a stale `volumeAttachment`, delete it manually: `kubectl delete volumeattachment `. For RWO volumes on CSI drivers, you may also need to force-detach at the storage layer (e.g., AWS EBS detach via CLI). After cleanup, recreate the VMI — the PVC should mount cleanly.

Fixing KubeVirt 'VMI Not Ready' Virt-Launcher Pod Crashes: Root Cause & Recovery Guide

Q: How can I tell which node is causing repeated virt-launcher crashes across multiple VMIs?

Run `kubectl get pods -n --field-selector=status.phase=Failed -o wide | grep virt-launcher` to correlate failed pods to a specific node. Then check virt-handler logs on that node: `kubectl logs -n kubevirt -l kubevirt.io=virt-handler --field-selector spec.nodeName= `. Common node-level culprits are: KVM device unavailable (`/dev/kvm` missing), a full ephemeral disk from previous crash debris, or a kernel version incompatible with the QEMU build in your KubeVirt version.

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause

TL;DR

What broke: The virt-launcher pod for your KubeVirt VirtualMachineInstance crashed or was OOMKilled, leaving the VMI in a perpetual Scheduling or Pending phase with VMI not ready surfaced by the virt-controller.
How to fix it: Identify the crash reason via kubectl logs and kubectl describe on the virt-launcher pod — fix is one of: bump resource limits, patch the VMI spec's CPU/memory, correct the PVC binding, or resolve a libvirt/QEMU socket permission issue.
Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing VMI YAML — paste it in, get the corrected spec back without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw signal you'll see across the stack:

# kubectl get vmi <vmi-name> -n <namespace>
NAME         AGE   PHASE     IP    NODENAME   READY
my-vm-0      4m    Scheduling                  False

# kubectl describe vmi my-vm-0 -n <namespace>
...
Warning  SyncFailed   3m   virt-controller  Error creating virt-launcher pod: ...
Warning  VMINotReady  2m   virt-handler     VMI not ready: virt-launcher pod is not running

# kubectl describe pod virt-launcher-my-vm-0-<hash> -n <namespace>
  Last State: Terminated
    Reason:   OOMKilled (or Error, CrashLoopBackOff)
    Exit Code: 137

Immediate consequence: The VM never boots. The VMI object exists but no QEMU process was ever spawned. If this is a StatefulSet-backed VM or a KubeVirt VirtualMachine with runStrategy: Always, the controller will keep respawning the pod, hammering the node with failed pod launches until you intervene.

The Attack Vector / Blast Radius

This is not a single-pod problem. Here's the cascade:

CrashLoopBackOff storm: virt-controller respawns virt-launcher on each reconcile loop. Each failed pod leaves ephemeral storage debris and consumes API server write bandwidth.
Node resource starvation: OOMKilled launchers that die mid-QEMU-init can leave orphaned /dev/kvm file descriptors and leaked tap interfaces (k6t-eth0) on the node. Confirmed via ip link show | grep k6t.
Noisy neighbor impact: On a shared node, repeated pod churn from a single broken VMI elevates kubelet reconcile pressure for every other workload on that node.
Data integrity risk: If the VMI was attached to a ReadWriteOnce PVC that didn't cleanly detach before the crash, the PVC can remain in a locked state, blocking any future VMI from mounting it — effectively bricking the volume until manual intervention.

Primary root causes ranked by frequency:

Cause	Signal
OOMKill (memory limit too low)	Exit Code 137, `Reason: OOMKilled`
Missing or unbound PVC	`FailedMount`, PVC in `Pending`
libvirt socket permission / securityContext	`permission denied` in virt-launcher logs
Node doesn't support KVM (nested virt disabled)	`/dev/kvm no such file`
CPU model incompatibility	QEMU `unsupported machine type`

How to Fix It (The Solution)

Step 1 — Pull the actual crash reason

# Get the terminated container logs (not the running one)
kubectl logs virt-launcher-my-vm-0-<hash> -n <namespace> --previous -c compute

# Also check virt-handler on the affected node
kubectl logs -n kubevirt -l kubevirt.io=virt-handler --field-selector spec.nodeName=<node>

Basic Fix — OOMKill: Raise Memory Limits on the VMI

 apiVersion: kubevirt.io/v1
 kind: VirtualMachineInstance
 metadata:
   name: my-vm-0
 spec:
   domain:
     resources:
       requests:
-        memory: 64M
+        memory: 2Gi
       limits:
-        memory: 128M
+        memory: 2Gi
     cpu:
       cores: 1

Rule: virt-launcher itself needs overhead on top of the guest RAM. KubeVirt adds an overhead buffer automatically, but if you hard-cap limits.memory below requests.memory + overhead, the pod is killed before QEMU initializes. Never set memory limits below 1Gi for any non-trivial guest.

Basic Fix — Unbound PVC

kubectl get pvc -n <namespace>
# If STATUS = Pending, your StorageClass is not provisioning
kubectl describe pvc <pvc-name> -n <namespace>

 volumes:
   - name: rootdisk
     persistentVolumeClaim:
-      claimName: my-vm-pvc-typo
+      claimName: my-vm-rootdisk-pvc

Enterprise Best Practice — Enforce Resource Floors with LimitRange + VMI Validation Webhook

Problem: Individual teams submit VMI specs with dangerously low memory. No guardrails exist at admission time.

Solution: Deploy a LimitRange in each VM namespace AND enforce a KubeVirt ValidatingWebhookConfiguration (or OPA/Gatekeeper policy) that rejects VMIs below safe thresholds.

# LimitRange for VM namespaces
 apiVersion: v1
 kind: LimitRange
 metadata:
   name: kubevirt-vm-limits
   namespace: vm-workloads
 spec:
   limits:
-  # No limits defined — any value accepted
+  - type: Container
+    min:
+      memory: 512Mi
+      cpu: "250m"
+    default:
+      memory: 2Gi
+      cpu: "1"
+    defaultRequest:
+      memory: 2Gi
+      cpu: "1"

# OPA Rego — Gatekeeper ConstraintTemplate excerpt
- # No policy — virt-launcher OOMKills silently
+ deny[msg] {
+   input.review.object.kind == "VirtualMachineInstance"
+   mem := input.review.object.spec.domain.resources.requests.memory
+   to_number(trim_suffix(mem, "Mi")) < 512
+   msg := "VMI memory request below 512Mi — virt-launcher will OOMKill"
+ }

Fix — KVM Not Available on Node

# On the suspect node
ls -la /dev/kvm
# If missing: nested virtualization is disabled in the hypervisor or BIOS

# For AWS (enable nested virt on metal instances only — .metal instance types)
# For GKE, use n2 or c2 node pools with --enable-nested-virtualization

# Verify KubeVirt node labels
kubectl get nodes -l kubevirt.io/schedulable=true

# If your VMI was scheduled to a non-KVM node due to missing nodeSelector:
 spec:
+  nodeSelector:
+    kubevirt.io/schedulable: "true"
   domain:
     ...

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing VMI YAML into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

Gate broken VMI specs before they hit the cluster.

1. Checkov — Static VMI YAML Scanning

pip install checkov
checkov -f vmi-manifest.yaml --framework kubernetes
# Custom check: flag any VMI with memory request < 512Mi

2. Kyverno Policy — Enforce KVM Node Affinity

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-kvm-node-selector
spec:
  validationFailureAction: enforce
  rules:
    - name: check-kvm-nodeselector
      match:
        resources:
          kinds: [VirtualMachineInstance]
      validate:
        message: "VMI must target kubevirt.io/schedulable=true nodes"
        pattern:
          spec:
            nodeSelector:
              kubevirt.io/schedulable: "true"

3. Pre-commit Hook — Validate PVC References Exist

#!/bin/bash
# pre-commit: validate PVC names in VMI specs match existing PVCs
PVC_NAMES=$(kubectl get pvc -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}')
REF=$(yq e '.spec.volumes[].persistentVolumeClaim.claimName' vmi-manifest.yaml)
if [[ ! " $PVC_NAMES " =~ " $REF " ]]; then
  echo "ERROR: PVC '$REF' not found in namespace $NAMESPACE"
  exit 1
fi

4. Alerting — PagerDuty/AlertManager Rule

- alert: KubeVirtVMINotReady
  expr: kubevirt_vmi_phase_count{phase="Scheduling"} > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "VMI stuck in Scheduling phase for >5min — likely virt-launcher crash"
    runbook: "https://your-wiki/kubevirt-virt-launcher-crash"