Initializing Enclave...

Fixing EFS CSI Driver Mount Timeout Errors in EKS: Connection Timed Out Troubleshooting Guide

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: The EFS CSI driver cannot establish an NFS v4.1 connection to the EFS mount target — pod stays in ContainerCreating indefinitely, blocking all stateful workloads on that node.
  • How to fix it: Verify security group rules allow TCP 2049 between node and EFS mount target, confirm the correct fileSystemId and subnet/AZ alignment, and ensure the node IAM role carries elasticfilesystem:ClientMount.
  • Fast path: Drop your StorageClass, PV, PVC, and node IAM policy into the Client-Side Sandbox above to auto-diagnose and refactor the config without leaking credentials.

The Incident (What does the error mean?)

Raw error surfaced in pod events and EFS CSI node driver logs:

Warning  FailedMount  3m  kubelet  
  MountVolume.SetUp failed for volume "efs-pv" :
  rpc error: code = Internal desc = 
  Could not mount "fs-0abc1234" at "/var/lib/kubelet/pods/<uid>/volumes/kubernetes.io~csi/efs-pv/mount":
  mounting failed: exit status 32
  Mounting command: mount
  Mounting arguments: -t efs -o tls fs-0abc1234:/ /var/lib/kubelet/...
  Output: mount.nfs4: Connection timed out

Immediate consequence: The pod never reaches Running. If this is a Deployment, all replicas stall. If this is a StatefulSet (databases, message queues, logging agents), data ingestion stops entirely. The EFS volume is not corrupted — the data is safe — but it is completely inaccessible to the cluster until the network path is restored.


The Attack Vector / Blast Radius

This is a network reachability failure, not an EFS service outage. The blast radius depends on how many workloads share the PVC or StorageClass:

  • Single PVC bound to multiple pods (ReadWriteMany): Every pod referencing that PV fails to start. A rolling deployment will stall mid-rollout, leaving you with zero healthy replicas if the old ReplicaSet was already scaled down.
  • Node-level failure: If efs-utils is missing or the node's security group is misconfigured, every EFS-backed pod scheduled to that node will fail, not just one workload.
  • Cascading scheduler pressure: Pods stuck in ContainerCreating consume scheduler cycles and can exhaust node resource reservations, causing unrelated pods to be evicted or pending.
  • Silent CI/CD breakage: Helm/ArgoCD deploys report success (manifests applied) but the application is non-functional. Alerting gaps on ContainerCreating duration mean this can go undetected for hours.

The four primary root causes, in order of frequency:

Root Cause Signal
Security group missing TCP 2049 inbound on EFS mount target SG Connection timed out in mount output
EFS mount target not in same AZ/subnet as node Timeout or No route to host
Missing IAM permission elasticfilesystem:ClientMount Permission denied or TLS handshake failure
amazon-efs-utils not installed on node AMI mount: unknown filesystem type 'efs'

How to Fix It (The Solution)

Basic Fix — Security Group Rule (Most Common Cause)

The EFS mount target's security group must allow inbound NFS from the node security group.

# Terraform: aws_security_group_rule for EFS mount target
- # No inbound rule defined — default deny
+ resource "aws_security_group_rule" "efs_nfs_inbound" {
+   type                     = "ingress"
+   from_port                = 2049
+   to_port                  = 2049
+   protocol                 = "tcp"
+   security_group_id        = aws_security_group.efs_mt_sg.id
+   source_security_group_id = aws_security_group.eks_node_sg.id  # nodes only, not 0.0.0.0/0
+   description              = "Allow NFS from EKS nodes to EFS mount target"
+ }

Basic Fix — StorageClass fileSystemId and AZ Binding

 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   name: efs-sc
 provisioner: efs.csi.aws.com
 parameters:
-  provisioningMode: efs-ap
-  fileSystemId: fs-WRONG123
-  directoryPerms: "700"
+  provisioningMode: efs-ap
+  fileSystemId: fs-0abc1234          # must match your actual EFS FS ID
+  directoryPerms: "700"
+  uid: "1000"
+  gid: "1000"
+  basePath: "/dynamic_provisioning"  # explicit base path avoids root permission issues

Enterprise Best Practice — IAM Role with Least-Privilege EFS Policy

 # IAM policy attached to EKS node instance role or IRSA role
 {
   "Version": "2012-10-17",
   "Statement": [
-    {
-      "Effect": "Allow",
-      "Action": "elasticfilesystem:*",
-      "Resource": "*"
-    }
+    {
+      "Effect": "Allow",
+      "Action": [
+        "elasticfilesystem:ClientMount",
+        "elasticfilesystem:ClientWrite",
+        "elasticfilesystem:DescribeMountTargets"
+      ],
+      "Resource": "arn:aws:elasticfilesystem:us-east-1:123456789012:file-system/fs-0abc1234",
+      "Condition": {
+        "Bool": {
+          "elasticfilesystem:AccessedViaMountTarget": "true"  # enforces mount-target-only access
+        }
+      }
+    }
   ]
 }

Enterprise Best Practice — Managed Node Group with efs-utils Pre-installed

 # EKS Managed Node Group launch template user data
- # No custom AMI or bootstrap — relies on default Amazon Linux 2
+ #!/bin/bash
+ yum install -y amazon-efs-utils   # required for 'mount -t efs' and TLS support
+ /etc/eks/bootstrap.sh ${cluster_name} \
+   --container-runtime containerd \
+   --kubelet-extra-args '--node-labels=node.kubernetes.io/efs-ready=true'

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov policy — block StorageClass without explicit fileSystemId:

# .checkov/custom_checks/efs_filesystem_id.yaml
metadata:
  name: EFS StorageClass must declare explicit fileSystemId
  id: CKV_CUSTOM_EFS_001
  category: KUBERNETES
scope:
  resource: StorageClass
assert:
  - attribute: parameters.fileSystemId
    operator: exists

2. OPA/Gatekeeper ConstraintTemplate — enforce EFS mount target AZ alignment:

package efsmountvalidation

violation[{"msg": msg}] {
  input.review.object.kind == "PersistentVolume"
  input.review.object.spec.csi.driver == "efs.csi.aws.com"
  not input.review.object.metadata.annotations["efs.csi.aws.com/mount-target-az"]
  msg := "EFS PV must annotate mount-target-az to enforce AZ-local mount target binding"
}

3. Terraform precondition block — validate SG rule existence before EFS mount target creation:

resource "aws_efs_mount_target" "main" {
  file_system_id  = aws_efs_file_system.main.id
  subnet_id       = var.subnet_id
  security_groups = [aws_security_group.efs_mt_sg.id]

  lifecycle {
    precondition {
      condition     = length(aws_security_group_rule.efs_nfs_inbound) > 0
      error_message = "EFS mount target requires explicit TCP 2049 inbound rule from node SG before provisioning."
    }
  }
}

4. GitHub Actions step — smoke-test NFS reachability before deploying EFS-dependent workloads:

- name: Verify EFS mount target reachability
  run: |
    nc -zv -w5 ${{ secrets.EFS_MOUNT_TARGET_IP }} 2049 || \
      (echo "FATAL: EFS port 2049 unreachable from runner VPC. Check SG rules." && exit 1)

Add a CloudWatch alarm on MetricName: ConnectionRefused for your EFS file system and a Kubernetes alert rule firing when any pod stays in ContainerCreating for more than 5 minutes — that combination catches this class of failure before it becomes a full outage.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →