Why does EFS CSI mount fail with 'Connection timed out' even though the EFS file system exists?

The file system existing in the console is irrelevant to mount reachability. The timeout means the NFS client on the node cannot reach the EFS mount target on TCP port 2049. The most common cause is a missing inbound security group rule on the EFS mount target's security group that allows traffic from the EKS node security group. Verify with: `aws ec2 describe-security-groups --group-ids ` and confirm a TCP 2049 inbound rule referencing the node SG exists.

Does the EFS mount target need to be in the same Availability Zone as the EKS node?

Not strictly required, but strongly recommended. Cross-AZ NFS traffic adds latency and incurs data transfer costs. More critically, if the mount target does not exist in the node's AZ, AWS routes the connection to a mount target in another AZ, which can cause intermittent timeouts under network congestion. Create one EFS mount target per AZ used by your node groups and use topology-aware scheduling (`topologySpreadConstraints`) to co-locate pods with their AZ-local mount target.

Is amazon-efs-utils required when using the EFS CSI driver with TLS?

Yes. The EFS CSI driver delegates the actual mount operation to `mount.efs` from the `amazon-efs-utils` package when TLS (`-o tls`) is specified. Without it, the mount fails with 'unknown filesystem type efs'. For managed node groups using the default Amazon Linux 2 EKS-optimized AMI, `amazon-efs-utils` is pre-installed. For Bottlerocket or custom AMIs, you must install it via user data or bake it into your AMI. Alternatively, use the `mountOptions: [nfsvers=4.1]` approach in your StorageClass to bypass efs-utils and mount directly via the kernel NFS client, but you lose TLS encryption in transit.

Fixing EFS CSI Driver Mount Timeout Errors in EKS: Connection Timed Out Troubleshooting Guide

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: The EFS CSI driver cannot establish an NFS v4.1 connection to the EFS mount target — pod stays in ContainerCreating indefinitely, blocking all stateful workloads on that node.
How to fix it: Verify security group rules allow TCP 2049 between node and EFS mount target, confirm the correct fileSystemId and subnet/AZ alignment, and ensure the node IAM role carries elasticfilesystem:ClientMount.
Fast path: Drop your StorageClass, PV, PVC, and node IAM policy into the Client-Side Sandbox above to auto-diagnose and refactor the config without leaking credentials.

The Incident (What does the error mean?)

Raw error surfaced in pod events and EFS CSI node driver logs:

Warning  FailedMount  3m  kubelet  
  MountVolume.SetUp failed for volume "efs-pv" :
  rpc error: code = Internal desc = 
  Could not mount "fs-0abc1234" at "/var/lib/kubelet/pods/<uid>/volumes/kubernetes.io~csi/efs-pv/mount":
  mounting failed: exit status 32
  Mounting command: mount
  Mounting arguments: -t efs -o tls fs-0abc1234:/ /var/lib/kubelet/...
  Output: mount.nfs4: Connection timed out

Immediate consequence: The pod never reaches Running. If this is a Deployment, all replicas stall. If this is a StatefulSet (databases, message queues, logging agents), data ingestion stops entirely. The EFS volume is not corrupted — the data is safe — but it is completely inaccessible to the cluster until the network path is restored.

The Attack Vector / Blast Radius

This is a network reachability failure, not an EFS service outage. The blast radius depends on how many workloads share the PVC or StorageClass:

Single PVC bound to multiple pods (ReadWriteMany): Every pod referencing that PV fails to start. A rolling deployment will stall mid-rollout, leaving you with zero healthy replicas if the old ReplicaSet was already scaled down.
Node-level failure: If efs-utils is missing or the node's security group is misconfigured, every EFS-backed pod scheduled to that node will fail, not just one workload.
Cascading scheduler pressure: Pods stuck in ContainerCreating consume scheduler cycles and can exhaust node resource reservations, causing unrelated pods to be evicted or pending.
Silent CI/CD breakage: Helm/ArgoCD deploys report success (manifests applied) but the application is non-functional. Alerting gaps on ContainerCreating duration mean this can go undetected for hours.

The four primary root causes, in order of frequency:

Root Cause	Signal
Security group missing TCP 2049 inbound on EFS mount target SG	`Connection timed out` in mount output
EFS mount target not in same AZ/subnet as node	Timeout or `No route to host`
Missing IAM permission `elasticfilesystem:ClientMount`	`Permission denied` or TLS handshake failure
`amazon-efs-utils` not installed on node AMI	`mount: unknown filesystem type 'efs'`

How to Fix It (The Solution)

Basic Fix — Security Group Rule (Most Common Cause)

The EFS mount target's security group must allow inbound NFS from the node security group.

# Terraform: aws_security_group_rule for EFS mount target
- # No inbound rule defined — default deny
+ resource "aws_security_group_rule" "efs_nfs_inbound" {
+   type                     = "ingress"
+   from_port                = 2049
+   to_port                  = 2049
+   protocol                 = "tcp"
+   security_group_id        = aws_security_group.efs_mt_sg.id
+   source_security_group_id = aws_security_group.eks_node_sg.id  # nodes only, not 0.0.0.0/0
+   description              = "Allow NFS from EKS nodes to EFS mount target"
+ }

Basic Fix — StorageClass fileSystemId and AZ Binding

 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   name: efs-sc
 provisioner: efs.csi.aws.com
 parameters:
-  provisioningMode: efs-ap
-  fileSystemId: fs-WRONG123
-  directoryPerms: "700"
+  provisioningMode: efs-ap
+  fileSystemId: fs-0abc1234          # must match your actual EFS FS ID
+  directoryPerms: "700"
+  uid: "1000"
+  gid: "1000"
+  basePath: "/dynamic_provisioning"  # explicit base path avoids root permission issues

Enterprise Best Practice — IAM Role with Least-Privilege EFS Policy

 # IAM policy attached to EKS node instance role or IRSA role
 {
   "Version": "2012-10-17",
   "Statement": [
-    {
-      "Effect": "Allow",
-      "Action": "elasticfilesystem:*",
-      "Resource": "*"
-    }
+    {
+      "Effect": "Allow",
+      "Action": [
+        "elasticfilesystem:ClientMount",
+        "elasticfilesystem:ClientWrite",
+        "elasticfilesystem:DescribeMountTargets"
+      ],
+      "Resource": "arn:aws:elasticfilesystem:us-east-1:123456789012:file-system/fs-0abc1234",
+      "Condition": {
+        "Bool": {
+          "elasticfilesystem:AccessedViaMountTarget": "true"  # enforces mount-target-only access
+        }
+      }
+    }
   ]
 }

Enterprise Best Practice — Managed Node Group with efs-utils Pre-installed

 # EKS Managed Node Group launch template user data
- # No custom AMI or bootstrap — relies on default Amazon Linux 2
+ #!/bin/bash
+ yum install -y amazon-efs-utils   # required for 'mount -t efs' and TLS support
+ /etc/eks/bootstrap.sh ${cluster_name} \
+   --container-runtime containerd \
+   --kubelet-extra-args '--node-labels=node.kubernetes.io/efs-ready=true'

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov policy — block StorageClass without explicit fileSystemId:

# .checkov/custom_checks/efs_filesystem_id.yaml
metadata:
  name: EFS StorageClass must declare explicit fileSystemId
  id: CKV_CUSTOM_EFS_001
  category: KUBERNETES
scope:
  resource: StorageClass
assert:
  - attribute: parameters.fileSystemId
    operator: exists

2. OPA/Gatekeeper ConstraintTemplate — enforce EFS mount target AZ alignment:

package efsmountvalidation

violation[{"msg": msg}] {
  input.review.object.kind == "PersistentVolume"
  input.review.object.spec.csi.driver == "efs.csi.aws.com"
  not input.review.object.metadata.annotations["efs.csi.aws.com/mount-target-az"]
  msg := "EFS PV must annotate mount-target-az to enforce AZ-local mount target binding"
}

3. Terraform precondition block — validate SG rule existence before EFS mount target creation:

resource "aws_efs_mount_target" "main" {
  file_system_id  = aws_efs_file_system.main.id
  subnet_id       = var.subnet_id
  security_groups = [aws_security_group.efs_mt_sg.id]

  lifecycle {
    precondition {
      condition     = length(aws_security_group_rule.efs_nfs_inbound) > 0
      error_message = "EFS mount target requires explicit TCP 2049 inbound rule from node SG before provisioning."
    }
  }
}

4. GitHub Actions step — smoke-test NFS reachability before deploying EFS-dependent workloads:

- name: Verify EFS mount target reachability
  run: |
    nc -zv -w5 ${{ secrets.EFS_MOUNT_TARGET_IP }} 2049 || \
      (echo "FATAL: EFS port 2049 unreachable from runner VPC. Check SG rules." && exit 1)

Add a CloudWatch alarm on MetricName: ConnectionRefused for your EFS file system and a Kubernetes alert rule firing when any pod stays in ContainerCreating for more than 5 minutes — that combination catches this class of failure before it becomes a full outage.