How to Fix Terraform 'Still Creating...' Hang and Creation Timeout Errors
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: Terraform's
applyis blocked waiting for a cloud provider API to confirm resource creation — the provider never receives aCOMPLETEstatus, so Terraform polls until the default timeout (usually 10–30 min) kills the run. - How to fix it: Explicitly set
timeoutsblocks on the offending resource, audit IAM/VPC/dependency prerequisites that are silently blocking the provider API, and useTF_LOG=DEBUGto identify the exact stuck API call. - Shortcut: Use our Client-Side Sandbox above to paste your failing
.tfconfig — it will auto-diagnose the stalled resource and generate the correctedtimeoutsblock and dependency chain.
The Incident (What Does the Error Mean?)
You see this in your terminal and it never moves:
aws_eks_cluster.main: Still creating... [10m0s elapsed]
aws_eks_cluster.main: Still creating... [20m0s elapsed]
╷
│ Error: waiting for EKS Cluster (prod-cluster) creation: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 30m0s)
│
│ with aws_eks_cluster.main,
│ on eks.tf line 12, in resource "aws_eks_cluster" "main":
│ 12: resource "aws_eks_cluster" "main" {
Immediate consequence: The Terraform state is now partially written. The resource may or may not exist in your cloud account. If you re-run apply, you risk a duplicate resource error or a state drift that requires manual terraform state rm surgery. In CI/CD pipelines, this kills the deployment job and leaves infrastructure in an unknown state — your next deploy will fail on a different error entirely.
The Attack Vector / Blast Radius
This is not just a slow run. The blast radius is significant:
1. State file corruption risk. Terraform writes resource IDs to state before confirming creation is complete. A timeout leaves a ghost entry — a resource ID in state that may point to a half-provisioned or non-existent cloud resource. Every subsequent plan will now show a diff that doesn't reflect reality.
2. Cascading dependency failures. Any resource with depends_on or an implicit reference to the timed-out resource (subnets, node groups, IAM role attachments) will also fail. An EKS timeout cascades into failed node groups, failed Helm releases, and failed DNS records — your entire stack is blocked.
3. Cloud-side resource limbo. AWS, GCP, and Azure do not roll back on Terraform timeout. The cloud resource continues its creation attempt. You now have a resource being provisioned that Terraform no longer tracks. This creates unmanaged infrastructure — it incurs cost, bypasses your IaC governance, and is invisible to future terraform plan runs.
4. Root causes that make this worse:
- Missing VPC endpoints causing EKS/RDS API calls to route over the public internet and hit throttling
- IAM role propagation delay (the infamous AWS eventual consistency window — typically 10–15s, but can spike under load)
- Security group rules blocking the provider's health-check polling
- Service quota exhaustion (vCPU limits, EIP limits) causing the cloud API to queue the request silently
- Incorrect
depends_onordering forcing Terraform to create resources before their prerequisites exist
How to Fix It (The Solution)
Basic Fix: Add Explicit timeouts Blocks and Increase Limits
The default timeouts in most providers are conservative. For complex resources (EKS, RDS Multi-AZ, GKE clusters), you must override them explicitly.
resource "aws_eks_cluster" "main" {
name = "prod-cluster"
role_arn = aws_iam_role.eks_cluster.arn
vpc_config {
subnet_ids = aws_subnet.private[*].id
}
- # No timeouts block — using provider default of 30 minutes
+ timeouts {
+ create = "60m"
+ update = "60m"
+ delete = "30m"
+ }
}
But increasing the timeout is a band-aid. If the resource is genuinely stuck, waiting longer just delays the failure. You must identify why it's stuck.
Diagnosing the Actual Cause
Run with full debug logging and filter for the API call that's looping:
# Enable full provider debug output
export TF_LOG=DEBUG
export TF_LOG_PATH="./terraform-debug.log"
terraform apply -auto-approve 2>&1 | tee apply-output.log
# Find the stuck polling loop
grep -E "(Retrying|Waiting|polling|CREATING|pending)" ./terraform-debug.log | tail -50
# Identify the exact AWS API call that's not returning COMPLETE
grep -E "(DescribeCluster|DescribeDBInstances|GetOperation)" ./terraform-debug.log | tail -20
Enterprise Best Practice: Fix the Root Cause — Dependency Ordering and IAM Propagation
The most common production cause is a race condition between IAM role creation and its first use. AWS IAM is eventually consistent. Terraform creates the role, immediately tries to use it, and the EKS/RDS service returns a validation error that causes silent retry loops.
resource "aws_iam_role" "eks_cluster" {
name = "eks-cluster-role"
assume_role_policy = data.aws_iam_policy_document.eks_assume_role.json
}
resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.eks_cluster.name
}
resource "aws_eks_cluster" "main" {
name = "prod-cluster"
role_arn = aws_iam_role.eks_cluster.arn
vpc_config {
subnet_ids = aws_subnet.private[*].id
+ endpoint_private_access = true
+ endpoint_public_access = false
}
- depends_on = [aws_iam_role_policy_attachment.eks_cluster_policy]
+ depends_on = [
+ aws_iam_role_policy_attachment.eks_cluster_policy,
+ # Force Terraform to wait for IAM propagation before EKS API calls
+ # Add a null_resource with a local-exec sleep if propagation races persist
+ ]
+ timeouts {
+ create = "60m"
+ update = "60m"
+ delete = "30m"
+ }
}
+# For persistent IAM propagation race conditions in automated pipelines:
+resource "time_sleep" "iam_propagation" {
+ depends_on = [aws_iam_role_policy_attachment.eks_cluster_policy]
+ create_duration = "30s"
+}
+
+# Then add time_sleep.iam_propagation to eks_cluster depends_on
For RDS Multi-AZ or Aurora Global clusters, the subnet group and parameter group must be fully created before the instance. Add explicit depends_on even when Terraform should infer it — the provider's implicit dependency graph sometimes misses cross-module references:
resource "aws_db_instance" "primary" {
identifier = "prod-db"
engine = "postgres"
instance_class = "db.r6g.xlarge"
multi_az = true
db_subnet_group_name = aws_db_subnet_group.main.name
+ depends_on = [
+ aws_db_subnet_group.main,
+ aws_db_parameter_group.main,
+ aws_security_group.rds
+ ]
+ timeouts {
+ create = "90m"
+ update = "80m"
+ delete = "60m"
+ }
}
Recovering From a Timed-Out State
If the resource actually got created in AWS but Terraform timed out:
# Check if the resource exists in AWS
aws eks describe-cluster --name prod-cluster --query 'cluster.status'
# If it exists and is ACTIVE, import it into state
terraform import aws_eks_cluster.main prod-cluster
# If it's stuck in CREATING, you must delete it from AWS first
aws eks delete-cluster --name prod-cluster
# Then remove the ghost entry from state
terraform state rm aws_eks_cluster.main
# Now re-apply with the fixed configuration
terraform apply
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Enforce timeout blocks with Checkov or tfsec:
# Add to your CI pipeline
checkov -d . --check CKV_TF_1 --compact
tfsec . --minimum-severity HIGH
Write a custom Checkov policy to fail pipelines on resources missing timeouts blocks:
# checkov/custom/check_timeouts.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories
RESOURCES_REQUIRING_TIMEOUTS = [
"aws_eks_cluster", "aws_rds_cluster", "aws_db_instance",
"google_container_cluster", "azurerm_kubernetes_cluster"
]
class TerraformTimeoutCheck(BaseResourceCheck):
def __init__(self):
super().__init__(
name="Ensure long-running resources define explicit timeouts",
id="CKV_CUSTOM_TIMEOUT_001",
categories=[CheckCategories.GENERAL_SECURITY],
supported_resources=RESOURCES_REQUIRING_TIMEOUTS
)
def scan_resource_conf(self, conf):
if "timeouts" not in conf or not conf["timeouts"]:
return CheckResult.FAILED
return CheckResult.PASSED
2. Set Terraform apply timeouts at the CI job level — never let a hung apply block your pipeline runner indefinitely:
# .github/workflows/terraform.yml
- name: Terraform Apply
timeout-minutes: 90 # Hard kill at the CI level
run: terraform apply -auto-approve
env:
TF_LOG: WARN # Capture warnings without full debug noise in CI
3. Use OPA/Conftest to validate dependency ordering before apply:
# policy/terraform_deps.rego
package terraform.eks
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_eks_cluster"
not has_iam_dependency(resource)
msg := sprintf("EKS cluster '%v' must declare depends_on for IAM role policy attachments", [resource.address])
}
has_iam_dependency(resource) {
dep := resource.change.after.depends_on[_]
contains(dep, "aws_iam_role_policy_attachment")
}
4. Monitor apply duration in observability tooling. If your EKS cluster normally creates in 12 minutes and your alert fires at 25 minutes, you catch the hang before the 30-minute timeout kills the job and corrupts state.