Initializing Enclave...

How to Fix EBS CSI PVC Stuck in Pending: 'Waiting for a Volume to Be Created by External Provisioner'

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–30 mins depending on root cause

TL;DR

  • What broke: Your PVC is stuck in Pending because the EBS CSI external provisioner cannot fulfill the volume request — caused by a missing/misconfigured CSI driver, wrong provisioner field in the StorageClass, IAM role missing ec2:CreateVolume permissions, or an AZ topology mismatch.
  • How to fix it: Verify the ebs.csi.aws.com driver is running, confirm IRSA/node IAM permissions, and ensure your StorageClass references the correct provisioner and AZ topology constraints.
  • Shortcut: Use our Client-Side Sandbox above to auto-refactor this — paste your kubectl describe pvc output and StorageClass YAML and get a corrected manifest instantly.

The Incident (What Does the Error Mean?)

You run kubectl describe pvc <your-pvc> and see:

Events:
  Type     Reason                Age                From                         Message
  ----     ------                ----               ----                         -------
  Warning  ProvisioningFailed    2m (x12 over 10m)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created

Immediate consequence: Every pod with a volumeMount referencing this PVC is stuck in Pending. Deployments, StatefulSets, and Jobs are fully blocked. If this is a database StatefulSet (Postgres, MySQL, Kafka), your application is down. The control plane is not broken — the provisioner simply never received a valid, actionable request or lacked the authority to execute it.


The Attack Vector / Blast Radius

This is a silent availability failure. The cluster appears healthy — nodes are Ready, the API server responds — but workloads never start. The blast radius depends on what was deploying:

  • StatefulSets: All replicas stuck. No leader elected. Quorum-based systems (etcd, Kafka, ZooKeeper) lose write availability entirely.
  • Rollouts: A deployment rollout stalls. The old ReplicaSet keeps serving, but if it was already scaled down or this is a fresh deploy, you have zero running pods.
  • Cascading HPA failure: HPA cannot scale what never started. Traffic spikes hit nothing.

The five root causes, in order of frequency:

  1. CSI driver not installed or pods are CrashLoopingebs-csi-controller pods in kube-system are not Running.
  2. Wrong provisioner name in StorageClass — legacy kubernetes.io/aws-ebs instead of ebs.csi.aws.com.
  3. IRSA not configured / IAM policy missing — the CSI controller pod cannot call ec2:CreateVolume, ec2:DescribeVolumes, ec2:AttachVolume.
  4. Availability Zone topology mismatch — PVC requests a volume in us-east-1a but no schedulable node exists there.
  5. StorageClass does not exist or is named incorrectly in the PVC spec.

How to Fix It (The Solution)

Step 0 — Triage in 60 Seconds

# Is the CSI driver running?
kubectl get pods -n kube-system -l app=ebs-csi-controller

# What is the provisioner on your StorageClass?
kubectl get sc <your-storage-class> -o jsonpath='{.provisioner}'

# Full event log on the stuck PVC
kubectl describe pvc <pvc-name> -n <namespace>

# CSI controller logs
kubectl logs -n kube-system -l app=ebs-csi-controller -c csi-provisioner --tail=50

Fix 1 — Wrong Provisioner in StorageClass (Most Common)

 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   name: gp3-ebs
- provisioner: kubernetes.io/aws-ebs
+ provisioner: ebs.csi.aws.com
 parameters:
-  type: gp2
+  type: gp3
+  encrypted: "true"
 reclaimPolicy: Retain
 volumeBindingMode: WaitForFirstConsumer
 allowVolumeExpansion: true

Critical: volumeBindingMode: WaitForFirstConsumer is mandatory for EBS. Without it, the provisioner tries to create the volume before knowing which AZ the pod will schedule into, causing topology failures.


Fix 2 — Missing IAM Permissions (IRSA)

The CSI controller pod must assume a role with at minimum:

 # IAM Policy attached to the CSI controller IRSA role
 {
   "Version": "2012-10-17",
   "Statement": [
     {
       "Effect": "Allow",
       "Action": [
-        "ec2:*"
+        "ec2:CreateVolume",
+        "ec2:DeleteVolume",
+        "ec2:AttachVolume",
+        "ec2:DetachVolume",
+        "ec2:DescribeVolumes",
+        "ec2:DescribeVolumeStatus",
+        "ec2:DescribeInstances",
+        "ec2:DescribeAvailabilityZones",
+        "ec2:CreateTags",
+        "ec2:ModifyVolume",
+        "ec2:DescribeVolumesModifications"
       ],
       "Resource": "*"
     }
   ]
 }

Annotate the CSI service account:

kubectl annotate serviceaccount ebs-csi-controller-sa \
  -n kube-system \
  eks.amazonaws.com/role-arn=arn:aws:iam::<ACCOUNT_ID>:role/AmazonEKS_EBS_CSI_DriverRole

Then restart the controller:

kubectl rollout restart deployment ebs-csi-controller -n kube-system

Fix 3 — Install the EBS CSI Driver (If Missing)

Enterprise Best Practice — EKS Add-on (not Helm):

 # Terraform
 resource "aws_eks_addon" "ebs_csi" {
   cluster_name             = aws_eks_cluster.main.name
   addon_name               = "aws-ebs-csi-driver"
-  # Not configured — driver missing entirely
+  addon_version            = "v1.30.0-eksbuild.1"
+  service_account_role_arn = aws_iam_role.ebs_csi_irsa.arn
+  resolve_conflicts_on_update = "OVERWRITE"
 }

Fix 4 — PVC Referencing Non-Existent StorageClass

 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
   name: postgres-data
 spec:
   accessModes:
     - ReadWriteOnce
   resources:
     requests:
       storage: 20Gi
-  storageClassName: standard
+  storageClassName: gp3-ebs

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

OPA/Gatekeeper Policy — Enforce Correct Provisioner

package k8s.storageclass

violation[{"msg": msg}] {
  input.review.object.kind == "StorageClass"
  input.review.object.provisioner != "ebs.csi.aws.com"
  msg := sprintf("StorageClass '%v' must use provisioner 'ebs.csi.aws.com', got '%v'",
    [input.review.object.metadata.name, input.review.object.provisioner])
}

violation[{"msg": msg}] {
  input.review.object.kind == "StorageClass"
  input.review.object.volumeBindingMode != "WaitForFirstConsumer"
  msg := "StorageClass must set volumeBindingMode: WaitForFirstConsumer for EBS"
}

Checkov in CI Pipeline

# .github/workflows/checkov.yml
- name: Scan Kubernetes manifests
  uses: bridgecrewio/checkov-action@master
  with:
    directory: k8s/
    framework: kubernetes
    check: CKV_K8S_28  # Ensure StorageClass uses approved provisioner
    soft_fail: false

Terraform Pre-Commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    hooks:
      - id: terraform_validate
      - id: terraform_tflint

Add to your tflint rules: assert aws_eks_addon for aws-ebs-csi-driver exists before any StatefulSet workload module is applied.

Smoke Test Post-Deploy

#!/bin/bash
# ebs-smoke-test.sh — run this in your CD pipeline after cluster provisioning
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ebs-smoke-pvc
  namespace: default
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3-ebs
  resources:
    requests:
      storage: 1Gi
EOF

sleep 30
STATUS=$(kubectl get pvc ebs-smoke-pvc -o jsonpath='{.status.phase}')
if [ "$STATUS" != "Bound" ]; then
  echo "FATAL: EBS CSI smoke test failed. PVC status: $STATUS"
  kubectl describe pvc ebs-smoke-pvc
  exit 1
fi
kubectl delete pvc ebs-smoke-pvc
echo "EBS CSI provisioner: OK"

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →