Fixing Terraform EKS Cluster Active Timeout: Why Your Control Plane Never Becomes Ready
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Terraform's default 30-minute waiter for
aws_eks_clusterexhausted before the EKS control plane reachedACTIVEstate — cluster creation rolled back or hung, blocking all node groups, Fargate profiles, and add-ons. - How to fix it: Audit the IAM role trust policy, verify subnet tags, confirm the
AWSServiceRoleForAmazonEKSservice-linked role exists, and explicitly extend the Terraformtimeoutsblock. - Sandbox: Use our Client-Side Sandbox below to auto-refactor your failing
aws_eks_clusterresource block.
The Incident (What Does the Error Mean?)
Raw Terraform output:
Error: waiting for EKS Cluster (prod-cluster) to become active: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 30m0s)
with aws_eks_cluster.main,
on eks.tf line 12, in resource "aws_eks_cluster" "main":
12: resource "aws_eks_cluster" "main" {
The EKS control plane API server, etcd, and networking fabric are provisioned asynchronously by AWS. Terraform polls the cluster state every 30 seconds. If the cluster doesn't flip to ACTIVE within the configured timeout window (default: 30 minutes), Terraform aborts and marks the resource as tainted. All downstream resources — aws_eks_node_group, aws_eks_fargate_profile, CoreDNS, kube-proxy add-ons — never provision. Your pipeline is dead.
The Attack Vector / Blast Radius
This is not a flaky AWS delay. In 90% of production cases, the cluster is stuck in CREATING indefinitely because of a hard blocker AWS silently queues:
IAM Role Trust Policy misconfiguration — The
eks.amazonaws.comprincipal is missing or the role lacksAmazonEKSClusterPolicy. AWS cannot assume the role to bootstrap the control plane. The cluster never leavesCREATING.Missing Service-Linked Role —
AWSServiceRoleForAmazonEKSmust exist in the account. First-time EKS accounts in a region often hit this. AWS will not auto-create it during cluster provisioning if certain SCP restrictions are in place.Subnet/VPC misconfiguration — Subnets missing the required tags (
kubernetes.io/cluster/<name>: sharedorowned) or subnets without sufficient free IPs cause the VPC resource controller to stall.Service Control Policy (SCP) blocking
eks:CreateClusteroriam:PassRole— The API call succeeds (returns 200) but the backend provisioner is silently denied, leaving the cluster in perpetualCREATING.
Blast radius: Every environment that depends on this cluster — staging, prod, DR — is offline. If this is a terraform apply in a CI/CD pipeline, the state lock may persist, blocking all subsequent runs.
How to Fix It
Step 1: Check the Real Stuck Reason First
# Check actual cluster status and any failure reason
aws eks describe-cluster --name prod-cluster \
--query 'cluster.{Status:status,Reason:connectorConfig}'
# Check CloudTrail for IAM PassRole denials in the last 1 hour
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=PassRole \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)
Basic Fix — Extend the Timeout
If AWS is genuinely slow (large regions, capacity events), extend the waiter:
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.eks_cluster.arn
+ timeouts {
+ create = "60m"
+ delete = "30m"
+ }
vpc_config {
subnet_ids = var.subnet_ids
}
}
Enterprise Best Practice — Fix the Root Cause (IAM + Subnet Tags + SLR)
# IAM Role Trust Policy — most common root cause
data "aws_iam_policy_document" "eks_assume_role" {
statement {
effect = "Allow"
principals {
type = "Service"
- identifiers = ["ec2.amazonaws.com"]
+ identifiers = ["eks.amazonaws.com"]
}
actions = ["sts:AssumeRole"]
}
}
resource "aws_iam_role" "eks_cluster" {
name = "eks-cluster-role"
assume_role_policy = data.aws_iam_policy_document.eks_assume_role.json
}
+resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
+ role = aws_iam_role.eks_cluster.name
+ policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
+}
+# Ensure service-linked role exists — safe to run even if it already exists
+resource "aws_iam_service_linked_role" "eks" {
+ aws_service_name = "eks.amazonaws.com"
+ lifecycle {
+ ignore_changes = [description]
+ }
+}
# Subnet tags required for EKS VPC resource controller
resource "aws_subnet" "private" {
for_each = var.private_subnets
vpc_id = aws_vpc.main.id
cidr_block = each.value
tags = {
- Name = "private-${each.key}"
+ Name = "private-${each.key}"
+ "kubernetes.io/cluster/${var.cluster_name}" = "shared"
+ "kubernetes.io/role/internal-elb" = "1"
}
}
resource "aws_eks_cluster" "main" {
name = var.cluster_name
role_arn = aws_iam_role.eks_cluster.arn
+ depends_on = [
+ aws_iam_role_policy_attachment.eks_cluster_policy,
+ aws_iam_service_linked_role.eks,
+ ]
+ timeouts {
+ create = "60m"
+ delete = "30m"
+ }
vpc_config {
subnet_ids = [for s in aws_subnet.private : s.id]
+ endpoint_private_access = true
+ endpoint_public_access = false
}
}
Critical: The
depends_onblock forces Terraform to fully resolve IAM before callingeks:CreateCluster. Without it, Terraform may fire the EKS API call before IAM replication completes (~10s propagation delay), causing an immediateCREATINGstall.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Catch Missing IAM Attachments Pre-Apply
# .checkov.yml
checks:
- CKV_AWS_58 # EKS control plane logging enabled
- CKV_AWS_37 # EKS cluster role has required policies
- CKV_AWS_39 # EKS subnets are private
checkov -d . --framework terraform --check CKV_AWS_58,CKV_AWS_37,CKV_AWS_39
2. OPA/Conftest Policy — Block Missing depends_on for EKS
# policy/eks_depends_on.rego
package terraform.eks
deny[msg] {
resource := input.resource.aws_eks_cluster[_]
not resource.depends_on
msg := "aws_eks_cluster must declare depends_on including IAM role policy attachments."
}
3. terraform validate + aws iam simulate-principal-policy in Pipeline
# Simulate PassRole before apply — catches SCP blocks early
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::${ACCOUNT_ID}:role/ci-deploy-role \
--action-names iam:PassRole \
--resource-arns arn:aws:iam::${ACCOUNT_ID}:role/eks-cluster-role \
--query 'EvaluationResults[0].EvalDecision'
If this returns implicitDeny or explicitDeny, your pipeline will always produce a stuck cluster. Fix the SCP or CI role permissions before terraform apply ever runs.