Fixing AKS NodePressure Memory Eviction with Azure Disk CSI: Root Cause & Recovery Guide
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on node pool size
TL;DR
- What broke: Kubelet on an AKS node hit the hard memory eviction threshold (
memory.available<100Miby default), evicting your pod. Azure Disk CSIVolumeAttachmentobjects remained orphaned, blocking rescheduling on a new node. - How to fix it: Set explicit
resources.requestsandresources.limitson every container, tune kubelet eviction thresholds via a customKubeletConfig, and patch the stuckVolumeAttachmentif the CSI disk won't detach. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your pod spec and node pool config without leaking your cluster credentials.
The Incident (What Does the Error Mean?)
You will see this in kubectl describe pod <pod-name>:
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory.
Threshold quantity: 100Mi, available: 42Mi.
Container <name> was using 1.9Gi, request is 0.
And on the node:
kubectl describe node <node-name>
Conditions:
Type Status
MemoryPressure True # <-- kubelet has declared war
DiskPressure False
PIDPressure False
Ready False
Immediate consequence: Kubelet terminates pods in priority order — BestEffort first (no requests/limits set), then Burstable, then Guaranteed. If your pod had no resources.requests, it is BestEffort and dies first, every time. The Azure Disk CSI driver then fails to detach the PersistentVolume cleanly because the pod was force-killed, leaving a VolumeAttachment object in Terminating state that blocks the pod from starting on a new node.
The Attack Vector / Blast Radius
This is not a single-pod problem. The cascade goes:
- One noisy-neighbor pod (no memory limit) balloons and consumes node memory.
- Kubelet evicts BestEffort pods — likely your stateful workloads with CSI disks, since those are often misconfigured.
- Azure Disk CSI
VolumeAttachmentgets stuck — theexternal-attachersidecar in the CSI controller can't issue aControllerUnpublishVolumeto the Azure API fast enough before the node is markedNotReady. - Pod reschedule is blocked — Kubernetes scheduler sees the PV still attached to the dead node. The pod sits in
ContainerCreatingorPendingindefinitely withMulti-Attach error for volume. - If you have no PodDisruptionBudget, a rolling eviction can take down your entire Deployment simultaneously.
- Node pool autoscaler may spin up a new node, but the stuck
VolumeAttachmentprevents the PV from binding to it — so you now have a new node and still a broken pod.
Blast radius: Full service outage for any stateful workload (databases, message queues, file processors) running on the affected node pool.
How to Fix It
Step 1: Unblock the Stuck VolumeAttachment (Immediate)
If your pod is stuck in ContainerCreating on a new node:
# Identify the stuck attachment
kubectl get volumeattachment
# Force-delete the finalizer — do this ONLY if the source node is confirmed dead/drained
kubectl patch volumeattachment <va-name> \
-p '{"metadata":{"finalizers":null}}' \
--type=merge
Warning: Only remove the finalizer if you have confirmed the Azure Disk is not actually attached to any running VM. Check in the Azure Portal under the disk's "Managed by" field.
Step 2: Fix the Root Cause — Resource Requests & Limits
The BestEffort QoS class is the direct cause of priority eviction. Every container needs explicit requests.
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-stateful-app
spec:
template:
spec:
containers:
- name: app
image: myrepo/app:1.4.2
- # No resources defined — BestEffort QoS, evicted first
+ resources:
+ requests:
+ memory: "512Mi"
+ cpu: "250m"
+ limits:
+ memory: "1Gi"
+ cpu: "1000m"
Enterprise Best Practice: Use Guaranteed QoS by setting requests == limits. This makes kubelet treat your pod as the last resort for eviction.
resources:
requests:
- memory: "512Mi"
- cpu: "250m"
+ memory: "1Gi"
+ cpu: "500m"
limits:
memory: "1Gi"
- cpu: "1000m"
+ cpu: "500m"
+ # requests == limits => Guaranteed QoS class
Step 3: Tune Kubelet Eviction Thresholds via AKS KubeletConfig
The default memory.available<100Mi hard eviction threshold is dangerously low for nodes running CSI workloads. Raise the soft threshold so kubelet starts reclaiming memory gracefully before hitting the hard wall.
apiVersion: v1
kind: ConfigMap
metadata:
name: kubelet-config
# Apply via AKS node pool KubeletConfig in the ARM template or Bicep:
---
apiVersion: 2023-01-01
type: Microsoft.ContainerService/managedClusters/agentPools
properties:
kubeletConfig:
- # Default: no custom eviction thresholds
+ evictionSoft:
+ memory.available: "300Mi"
+ nodefs.available: "10%"
+ evictionSoftGracePeriod:
+ memory.available: "2m"
+ nodefs.available: "2m"
+ evictionHard:
+ memory.available: "150Mi"
+ nodefs.available: "5%"
+ evictionMaxPodGracePeriod: 90
Apply via Azure CLI:
az aks nodepool update \
--resource-group <rg> \
--cluster-name <cluster> \
--name <nodepool> \
--kubelet-config kubelet-config.json
Step 4: Add a PodDisruptionBudget
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+ name: my-stateful-app-pdb
+spec:
+ minAvailable: 1
+ selector:
+ matchLabels:
+ app: my-stateful-app
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
OPA/Gatekeeper Policy — Block BestEffort QoS at Admission
Deploy this ConstraintTemplate to reject any pod without resource requests:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: requireresourcelimits
spec:
crd:
spec:
names:
kind: RequireResourceLimits
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package requireresourcelimits
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not container.resources.requests.memory
msg := sprintf("Container '%v' missing memory request. BestEffort QoS is prohibited.", [container.name])
}
Checkov — Scan IaC Before Apply
# In your CI pipeline (GitHub Actions, Azure DevOps)
checkov -d ./k8s-manifests \
--check CKV_K8S_11 \ # CPU limits
--check CKV_K8S_13 \ # Memory limits
--check CKV_K8S_10 \ # CPU requests
--check CKV_K8S_12 # Memory requests
Node Pool Sizing — Right-Size Before You Tune Thresholds
# Check actual memory consumption vs allocatable
kubectl describe node <node> | grep -A5 "Allocated resources"
# If requests > 80% of allocatable memory, scale the node pool
az aks nodepool scale \
--resource-group <rg> \
--cluster-name <cluster> \
--name <nodepool> \
--node-count <n+2>
Rule of thumb: Keep total pod memory requests below 70% of node allocatable memory to leave headroom for kubelet, system daemons, and CSI driver sidecars (which themselves consume 50–150Mi per node).