Fixing EFS CSI Driver Mount Timeout Errors in EKS: Connection Timed Out Troubleshooting Guide
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: The EFS CSI driver cannot establish an NFS v4.1 connection to the EFS mount target — pod stays in
ContainerCreatingindefinitely, blocking all stateful workloads on that node. - How to fix it: Verify security group rules allow TCP 2049 between node and EFS mount target, confirm the correct
fileSystemIdand subnet/AZ alignment, and ensure the node IAM role carrieselasticfilesystem:ClientMount. - Fast path: Drop your StorageClass, PV, PVC, and node IAM policy into the Client-Side Sandbox above to auto-diagnose and refactor the config without leaking credentials.
The Incident (What does the error mean?)
Raw error surfaced in pod events and EFS CSI node driver logs:
Warning FailedMount 3m kubelet
MountVolume.SetUp failed for volume "efs-pv" :
rpc error: code = Internal desc =
Could not mount "fs-0abc1234" at "/var/lib/kubelet/pods/<uid>/volumes/kubernetes.io~csi/efs-pv/mount":
mounting failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o tls fs-0abc1234:/ /var/lib/kubelet/...
Output: mount.nfs4: Connection timed out
Immediate consequence: The pod never reaches Running. If this is a Deployment, all replicas stall. If this is a StatefulSet (databases, message queues, logging agents), data ingestion stops entirely. The EFS volume is not corrupted — the data is safe — but it is completely inaccessible to the cluster until the network path is restored.
The Attack Vector / Blast Radius
This is a network reachability failure, not an EFS service outage. The blast radius depends on how many workloads share the PVC or StorageClass:
- Single PVC bound to multiple pods (ReadWriteMany): Every pod referencing that PV fails to start. A rolling deployment will stall mid-rollout, leaving you with zero healthy replicas if the old ReplicaSet was already scaled down.
- Node-level failure: If
efs-utilsis missing or the node's security group is misconfigured, every EFS-backed pod scheduled to that node will fail, not just one workload. - Cascading scheduler pressure: Pods stuck in
ContainerCreatingconsume scheduler cycles and can exhaust node resource reservations, causing unrelated pods to be evicted or pending. - Silent CI/CD breakage: Helm/ArgoCD deploys report success (manifests applied) but the application is non-functional. Alerting gaps on
ContainerCreatingduration mean this can go undetected for hours.
The four primary root causes, in order of frequency:
| Root Cause | Signal |
|---|---|
| Security group missing TCP 2049 inbound on EFS mount target SG | Connection timed out in mount output |
| EFS mount target not in same AZ/subnet as node | Timeout or No route to host |
Missing IAM permission elasticfilesystem:ClientMount |
Permission denied or TLS handshake failure |
amazon-efs-utils not installed on node AMI |
mount: unknown filesystem type 'efs' |
How to Fix It (The Solution)
Basic Fix — Security Group Rule (Most Common Cause)
The EFS mount target's security group must allow inbound NFS from the node security group.
# Terraform: aws_security_group_rule for EFS mount target
- # No inbound rule defined — default deny
+ resource "aws_security_group_rule" "efs_nfs_inbound" {
+ type = "ingress"
+ from_port = 2049
+ to_port = 2049
+ protocol = "tcp"
+ security_group_id = aws_security_group.efs_mt_sg.id
+ source_security_group_id = aws_security_group.eks_node_sg.id # nodes only, not 0.0.0.0/0
+ description = "Allow NFS from EKS nodes to EFS mount target"
+ }
Basic Fix — StorageClass fileSystemId and AZ Binding
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
parameters:
- provisioningMode: efs-ap
- fileSystemId: fs-WRONG123
- directoryPerms: "700"
+ provisioningMode: efs-ap
+ fileSystemId: fs-0abc1234 # must match your actual EFS FS ID
+ directoryPerms: "700"
+ uid: "1000"
+ gid: "1000"
+ basePath: "/dynamic_provisioning" # explicit base path avoids root permission issues
Enterprise Best Practice — IAM Role with Least-Privilege EFS Policy
# IAM policy attached to EKS node instance role or IRSA role
{
"Version": "2012-10-17",
"Statement": [
- {
- "Effect": "Allow",
- "Action": "elasticfilesystem:*",
- "Resource": "*"
- }
+ {
+ "Effect": "Allow",
+ "Action": [
+ "elasticfilesystem:ClientMount",
+ "elasticfilesystem:ClientWrite",
+ "elasticfilesystem:DescribeMountTargets"
+ ],
+ "Resource": "arn:aws:elasticfilesystem:us-east-1:123456789012:file-system/fs-0abc1234",
+ "Condition": {
+ "Bool": {
+ "elasticfilesystem:AccessedViaMountTarget": "true" # enforces mount-target-only access
+ }
+ }
+ }
]
}
Enterprise Best Practice — Managed Node Group with efs-utils Pre-installed
# EKS Managed Node Group launch template user data
- # No custom AMI or bootstrap — relies on default Amazon Linux 2
+ #!/bin/bash
+ yum install -y amazon-efs-utils # required for 'mount -t efs' and TLS support
+ /etc/eks/bootstrap.sh ${cluster_name} \
+ --container-runtime containerd \
+ --kubelet-extra-args '--node-labels=node.kubernetes.io/efs-ready=true'
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov policy — block StorageClass without explicit fileSystemId:
# .checkov/custom_checks/efs_filesystem_id.yaml
metadata:
name: EFS StorageClass must declare explicit fileSystemId
id: CKV_CUSTOM_EFS_001
category: KUBERNETES
scope:
resource: StorageClass
assert:
- attribute: parameters.fileSystemId
operator: exists
2. OPA/Gatekeeper ConstraintTemplate — enforce EFS mount target AZ alignment:
package efsmountvalidation
violation[{"msg": msg}] {
input.review.object.kind == "PersistentVolume"
input.review.object.spec.csi.driver == "efs.csi.aws.com"
not input.review.object.metadata.annotations["efs.csi.aws.com/mount-target-az"]
msg := "EFS PV must annotate mount-target-az to enforce AZ-local mount target binding"
}
3. Terraform precondition block — validate SG rule existence before EFS mount target creation:
resource "aws_efs_mount_target" "main" {
file_system_id = aws_efs_file_system.main.id
subnet_id = var.subnet_id
security_groups = [aws_security_group.efs_mt_sg.id]
lifecycle {
precondition {
condition = length(aws_security_group_rule.efs_nfs_inbound) > 0
error_message = "EFS mount target requires explicit TCP 2049 inbound rule from node SG before provisioning."
}
}
}
4. GitHub Actions step — smoke-test NFS reachability before deploying EFS-dependent workloads:
- name: Verify EFS mount target reachability
run: |
nc -zv -w5 ${{ secrets.EFS_MOUNT_TARGET_IP }} 2049 || \
(echo "FATAL: EFS port 2049 unreachable from runner VPC. Check SG rules." && exit 1)
Add a CloudWatch alarm on MetricName: ConnectionRefused for your EFS file system and a Kubernetes alert rule firing when any pod stays in ContainerCreating for more than 5 minutes — that combination catches this class of failure before it becomes a full outage.