Initializing Enclave...

Fixing AKS NodePressure Memory Eviction with Azure Disk CSI: Root Cause & Recovery Guide

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on node pool size

TL;DR

  • What broke: Kubelet on an AKS node hit the hard memory eviction threshold (memory.available<100Mi by default), evicting your pod. Azure Disk CSI VolumeAttachment objects remained orphaned, blocking rescheduling on a new node.
  • How to fix it: Set explicit resources.requests and resources.limits on every container, tune kubelet eviction thresholds via a custom KubeletConfig, and patch the stuck VolumeAttachment if the CSI disk won't detach.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your pod spec and node pool config without leaking your cluster credentials.

The Incident (What Does the Error Mean?)

You will see this in kubectl describe pod <pod-name>:

Status:     Failed
Reason:     Evicted
Message:    The node was low on resource: memory.
            Threshold quantity: 100Mi, available: 42Mi.
            Container <name> was using 1.9Gi, request is 0.

And on the node:

kubectl describe node <node-name>

Conditions:
  Type                 Status
  MemoryPressure       True     # <-- kubelet has declared war
  DiskPressure         False
  PIDPressure          False
  Ready                False

Immediate consequence: Kubelet terminates pods in priority order — BestEffort first (no requests/limits set), then Burstable, then Guaranteed. If your pod had no resources.requests, it is BestEffort and dies first, every time. The Azure Disk CSI driver then fails to detach the PersistentVolume cleanly because the pod was force-killed, leaving a VolumeAttachment object in Terminating state that blocks the pod from starting on a new node.


The Attack Vector / Blast Radius

This is not a single-pod problem. The cascade goes:

  1. One noisy-neighbor pod (no memory limit) balloons and consumes node memory.
  2. Kubelet evicts BestEffort pods — likely your stateful workloads with CSI disks, since those are often misconfigured.
  3. Azure Disk CSI VolumeAttachment gets stuck — the external-attacher sidecar in the CSI controller can't issue a ControllerUnpublishVolume to the Azure API fast enough before the node is marked NotReady.
  4. Pod reschedule is blocked — Kubernetes scheduler sees the PV still attached to the dead node. The pod sits in ContainerCreating or Pending indefinitely with Multi-Attach error for volume.
  5. If you have no PodDisruptionBudget, a rolling eviction can take down your entire Deployment simultaneously.
  6. Node pool autoscaler may spin up a new node, but the stuck VolumeAttachment prevents the PV from binding to it — so you now have a new node and still a broken pod.

Blast radius: Full service outage for any stateful workload (databases, message queues, file processors) running on the affected node pool.


How to Fix It

Step 1: Unblock the Stuck VolumeAttachment (Immediate)

If your pod is stuck in ContainerCreating on a new node:

# Identify the stuck attachment
kubectl get volumeattachment

# Force-delete the finalizer — do this ONLY if the source node is confirmed dead/drained
kubectl patch volumeattachment <va-name> \
  -p '{"metadata":{"finalizers":null}}' \
  --type=merge

Warning: Only remove the finalizer if you have confirmed the Azure Disk is not actually attached to any running VM. Check in the Azure Portal under the disk's "Managed by" field.

Step 2: Fix the Root Cause — Resource Requests & Limits

The BestEffort QoS class is the direct cause of priority eviction. Every container needs explicit requests.

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-stateful-app
 spec:
   template:
     spec:
       containers:
       - name: app
         image: myrepo/app:1.4.2
-        # No resources defined — BestEffort QoS, evicted first
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "250m"
+          limits:
+            memory: "1Gi"
+            cpu: "1000m"

Enterprise Best Practice: Use Guaranteed QoS by setting requests == limits. This makes kubelet treat your pod as the last resort for eviction.

         resources:
           requests:
-            memory: "512Mi"
-            cpu: "250m"
+            memory: "1Gi"
+            cpu: "500m"
           limits:
             memory: "1Gi"
-            cpu: "1000m"
+            cpu: "500m"
+        # requests == limits => Guaranteed QoS class

Step 3: Tune Kubelet Eviction Thresholds via AKS KubeletConfig

The default memory.available<100Mi hard eviction threshold is dangerously low for nodes running CSI workloads. Raise the soft threshold so kubelet starts reclaiming memory gracefully before hitting the hard wall.

 apiVersion: v1
 kind: ConfigMap
 metadata:
   name: kubelet-config
 # Apply via AKS node pool KubeletConfig in the ARM template or Bicep:
---
 apiVersion: 2023-01-01
 type: Microsoft.ContainerService/managedClusters/agentPools
 properties:
   kubeletConfig:
-    # Default: no custom eviction thresholds
+    evictionSoft:
+      memory.available: "300Mi"
+      nodefs.available: "10%"
+    evictionSoftGracePeriod:
+      memory.available: "2m"
+      nodefs.available: "2m"
+    evictionHard:
+      memory.available: "150Mi"
+      nodefs.available: "5%"
+    evictionMaxPodGracePeriod: 90

Apply via Azure CLI:

az aks nodepool update \
  --resource-group <rg> \
  --cluster-name <cluster> \
  --name <nodepool> \
  --kubelet-config kubelet-config.json

Step 4: Add a PodDisruptionBudget

+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: my-stateful-app-pdb
+spec:
+  minAvailable: 1
+  selector:
+    matchLabels:
+      app: my-stateful-app

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

OPA/Gatekeeper Policy — Block BestEffort QoS at Admission

Deploy this ConstraintTemplate to reject any pod without resource requests:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: requireresourcelimits
spec:
  crd:
    spec:
      names:
        kind: RequireResourceLimits
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package requireresourcelimits
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not container.resources.requests.memory
          msg := sprintf("Container '%v' missing memory request. BestEffort QoS is prohibited.", [container.name])
        }

Checkov — Scan IaC Before Apply

# In your CI pipeline (GitHub Actions, Azure DevOps)
checkov -d ./k8s-manifests \
  --check CKV_K8S_11 \ # CPU limits
  --check CKV_K8S_13 \ # Memory limits
  --check CKV_K8S_10 \ # CPU requests
  --check CKV_K8S_12    # Memory requests

Node Pool Sizing — Right-Size Before You Tune Thresholds

# Check actual memory consumption vs allocatable
kubectl describe node <node> | grep -A5 "Allocated resources"

# If requests > 80% of allocatable memory, scale the node pool
az aks nodepool scale \
  --resource-group <rg> \
  --cluster-name <cluster> \
  --name <nodepool> \
  --node-count <n+2>

Rule of thumb: Keep total pod memory requests below 70% of node allocatable memory to leave headroom for kubelet, system daemons, and CSI driver sidecars (which themselves consume 50–150Mi per node).

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →