Initializing Enclave...

How to Fix Terraform 'Still Creating...' Hang and Creation Timeout Errors

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: Terraform's apply is blocked waiting for a cloud provider API to confirm resource creation — the provider never receives a COMPLETE status, so Terraform polls until the default timeout (usually 10–30 min) kills the run.
  • How to fix it: Explicitly set timeouts blocks on the offending resource, audit IAM/VPC/dependency prerequisites that are silently blocking the provider API, and use TF_LOG=DEBUG to identify the exact stuck API call.
  • Shortcut: Use our Client-Side Sandbox above to paste your failing .tf config — it will auto-diagnose the stalled resource and generate the corrected timeouts block and dependency chain.

The Incident (What Does the Error Mean?)

You see this in your terminal and it never moves:

aws_eks_cluster.main: Still creating... [10m0s elapsed]
aws_eks_cluster.main: Still creating... [20m0s elapsed]
╷
│ Error: waiting for EKS Cluster (prod-cluster) creation: timeout while waiting for state to become 'ACTIVE' (last state: 'CREATING', timeout: 30m0s)
│
│   with aws_eks_cluster.main,
│   on eks.tf line 12, in resource "aws_eks_cluster" "main":
│  12: resource "aws_eks_cluster" "main" {

Immediate consequence: The Terraform state is now partially written. The resource may or may not exist in your cloud account. If you re-run apply, you risk a duplicate resource error or a state drift that requires manual terraform state rm surgery. In CI/CD pipelines, this kills the deployment job and leaves infrastructure in an unknown state — your next deploy will fail on a different error entirely.


The Attack Vector / Blast Radius

This is not just a slow run. The blast radius is significant:

1. State file corruption risk. Terraform writes resource IDs to state before confirming creation is complete. A timeout leaves a ghost entry — a resource ID in state that may point to a half-provisioned or non-existent cloud resource. Every subsequent plan will now show a diff that doesn't reflect reality.

2. Cascading dependency failures. Any resource with depends_on or an implicit reference to the timed-out resource (subnets, node groups, IAM role attachments) will also fail. An EKS timeout cascades into failed node groups, failed Helm releases, and failed DNS records — your entire stack is blocked.

3. Cloud-side resource limbo. AWS, GCP, and Azure do not roll back on Terraform timeout. The cloud resource continues its creation attempt. You now have a resource being provisioned that Terraform no longer tracks. This creates unmanaged infrastructure — it incurs cost, bypasses your IaC governance, and is invisible to future terraform plan runs.

4. Root causes that make this worse:

  • Missing VPC endpoints causing EKS/RDS API calls to route over the public internet and hit throttling
  • IAM role propagation delay (the infamous AWS eventual consistency window — typically 10–15s, but can spike under load)
  • Security group rules blocking the provider's health-check polling
  • Service quota exhaustion (vCPU limits, EIP limits) causing the cloud API to queue the request silently
  • Incorrect depends_on ordering forcing Terraform to create resources before their prerequisites exist

How to Fix It (The Solution)

Basic Fix: Add Explicit timeouts Blocks and Increase Limits

The default timeouts in most providers are conservative. For complex resources (EKS, RDS Multi-AZ, GKE clusters), you must override them explicitly.

 resource "aws_eks_cluster" "main" {
   name     = "prod-cluster"
   role_arn = aws_iam_role.eks_cluster.arn

   vpc_config {
     subnet_ids = aws_subnet.private[*].id
   }

-  # No timeouts block — using provider default of 30 minutes
+  timeouts {
+    create = "60m"
+    update = "60m"
+    delete = "30m"
+  }
 }

But increasing the timeout is a band-aid. If the resource is genuinely stuck, waiting longer just delays the failure. You must identify why it's stuck.

Diagnosing the Actual Cause

Run with full debug logging and filter for the API call that's looping:

# Enable full provider debug output
export TF_LOG=DEBUG
export TF_LOG_PATH="./terraform-debug.log"
terraform apply -auto-approve 2>&1 | tee apply-output.log

# Find the stuck polling loop
grep -E "(Retrying|Waiting|polling|CREATING|pending)" ./terraform-debug.log | tail -50

# Identify the exact AWS API call that's not returning COMPLETE
grep -E "(DescribeCluster|DescribeDBInstances|GetOperation)" ./terraform-debug.log | tail -20

Enterprise Best Practice: Fix the Root Cause — Dependency Ordering and IAM Propagation

The most common production cause is a race condition between IAM role creation and its first use. AWS IAM is eventually consistent. Terraform creates the role, immediately tries to use it, and the EKS/RDS service returns a validation error that causes silent retry loops.

 resource "aws_iam_role" "eks_cluster" {
   name               = "eks-cluster-role"
   assume_role_policy = data.aws_iam_policy_document.eks_assume_role.json
 }

 resource "aws_iam_role_policy_attachment" "eks_cluster_policy" {
   policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
   role       = aws_iam_role.eks_cluster.name
 }

 resource "aws_eks_cluster" "main" {
   name     = "prod-cluster"
   role_arn = aws_iam_role.eks_cluster.arn

   vpc_config {
     subnet_ids              = aws_subnet.private[*].id
+    endpoint_private_access = true
+    endpoint_public_access  = false
   }

-  depends_on = [aws_iam_role_policy_attachment.eks_cluster_policy]
+  depends_on = [
+    aws_iam_role_policy_attachment.eks_cluster_policy,
+    # Force Terraform to wait for IAM propagation before EKS API calls
+    # Add a null_resource with a local-exec sleep if propagation races persist
+  ]

+  timeouts {
+    create = "60m"
+    update = "60m"
+    delete = "30m"
+  }
 }

+# For persistent IAM propagation race conditions in automated pipelines:
+resource "time_sleep" "iam_propagation" {
+  depends_on      = [aws_iam_role_policy_attachment.eks_cluster_policy]
+  create_duration = "30s"
+}
+
+# Then add time_sleep.iam_propagation to eks_cluster depends_on

For RDS Multi-AZ or Aurora Global clusters, the subnet group and parameter group must be fully created before the instance. Add explicit depends_on even when Terraform should infer it — the provider's implicit dependency graph sometimes misses cross-module references:

 resource "aws_db_instance" "primary" {
   identifier        = "prod-db"
   engine            = "postgres"
   instance_class    = "db.r6g.xlarge"
   multi_az          = true
   db_subnet_group_name = aws_db_subnet_group.main.name

+  depends_on = [
+    aws_db_subnet_group.main,
+    aws_db_parameter_group.main,
+    aws_security_group.rds
+  ]

+  timeouts {
+    create = "90m"
+    update = "80m"
+    delete = "60m"
+  }
 }

Recovering From a Timed-Out State

If the resource actually got created in AWS but Terraform timed out:

# Check if the resource exists in AWS
aws eks describe-cluster --name prod-cluster --query 'cluster.status'

# If it exists and is ACTIVE, import it into state
terraform import aws_eks_cluster.main prod-cluster

# If it's stuck in CREATING, you must delete it from AWS first
aws eks delete-cluster --name prod-cluster

# Then remove the ghost entry from state
terraform state rm aws_eks_cluster.main

# Now re-apply with the fixed configuration
terraform apply

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Enforce timeout blocks with Checkov or tfsec:

# Add to your CI pipeline
checkov -d . --check CKV_TF_1 --compact
tfsec . --minimum-severity HIGH

Write a custom Checkov policy to fail pipelines on resources missing timeouts blocks:

# checkov/custom/check_timeouts.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories

RESOURCES_REQUIRING_TIMEOUTS = [
    "aws_eks_cluster", "aws_rds_cluster", "aws_db_instance",
    "google_container_cluster", "azurerm_kubernetes_cluster"
]

class TerraformTimeoutCheck(BaseResourceCheck):
    def __init__(self):
        super().__init__(
            name="Ensure long-running resources define explicit timeouts",
            id="CKV_CUSTOM_TIMEOUT_001",
            categories=[CheckCategories.GENERAL_SECURITY],
            supported_resources=RESOURCES_REQUIRING_TIMEOUTS
        )

    def scan_resource_conf(self, conf):
        if "timeouts" not in conf or not conf["timeouts"]:
            return CheckResult.FAILED
        return CheckResult.PASSED

2. Set Terraform apply timeouts at the CI job level — never let a hung apply block your pipeline runner indefinitely:

# .github/workflows/terraform.yml
- name: Terraform Apply
  timeout-minutes: 90  # Hard kill at the CI level
  run: terraform apply -auto-approve
  env:
    TF_LOG: WARN  # Capture warnings without full debug noise in CI

3. Use OPA/Conftest to validate dependency ordering before apply:

# policy/terraform_deps.rego
package terraform.eks

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_eks_cluster"
  not has_iam_dependency(resource)
  msg := sprintf("EKS cluster '%v' must declare depends_on for IAM role policy attachments", [resource.address])
}

has_iam_dependency(resource) {
  dep := resource.change.after.depends_on[_]
  contains(dep, "aws_iam_role_policy_attachment")
}

4. Monitor apply duration in observability tooling. If your EKS cluster normally creates in 12 minutes and your alert fires at 25 minutes, you catch the hang before the 30-minute timeout kills the job and corrupts state.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →