Why does Terraform not automatically wait for IAM role propagation?

Terraform considers an `aws_iam_role` resource complete the moment the AWS API returns a successful `CreateRole` response. Terraform has no visibility into AWS's internal global replication state for IAM — that's an AWS control plane concern. The `200 OK` from the IAM API does not guarantee the role is queryable by all AWS services globally. This is a known architectural gap; the `time_sleep` workaround is the officially recommended pattern in HashiCorp's own AWS provider documentation.

How long should the time_sleep duration be for IAM propagation?

For most workloads, 10–15 seconds covers the P99 propagation window. For EKS node groups and complex multi-policy role configurations, use 20 seconds. Avoid going below 10 seconds in CI/CD pipelines where AWS control plane latency is unpredictable. Going above 30 seconds provides diminishing returns and meaningfully slows pipelines. The sweet spot for production pipelines is `create_duration = "15s"` for Lambda/EC2 and `create_duration = "20s"` for EKS.

Does this IAM eventual consistency issue affect Terraform destroy operations too?

Rarely, but yes. During `terraform destroy`, if a service resource (like a Lambda) is destroyed before the IAM role detachment is fully propagated, you may see lingering permission errors. More commonly, destroy ordering issues manifest as `DeleteConflict` errors when attempting to delete a role that AWS services still believe is in use. Adding `time_sleep` on destroy with `destroy_duration = "10s"` on the `time_sleep` resource handles this edge case.

Fixing Terraform IAM Role Eventual Consistency: NoSuchEntity Error After Role Creation

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins

TL;DR

What broke: Terraform created an IAM role and immediately tried to attach it to a Lambda, EC2 instance profile, or EKS node group. AWS IAM's global replication hadn't propagated the role yet — the downstream resource API returned NoSuchEntity or InvalidParameterValue: arn:aws:iam::123456789:role/my-role is not authorized.
How to fix it: Inject an explicit propagation delay using the hashicorp/time provider's time_sleep resource as a depends_on gate between the role and any resource that consumes it.
Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your failing .tf block and get the corrected dependency graph with the sleep gate injected.

The Incident (What Does the Error Mean?)

The raw error surfaces in terraform apply output like this:

Error: error creating Lambda Function (my-function): InvalidParameterValueException:
  The role defined for the function cannot be assumed by Lambda.
  {RespMetadata: {StatusCode:400},
   Message_: "The role defined for the function cannot be assumed by Lambda."
  }

  with aws_lambda_function.my_function,
  on lambda.tf line 12, in resource "aws_lambda_function" "my_function":
  12: resource "aws_lambda_function" "my_function" {

Or for EC2/EKS:

Error: Error creating IAM instance profile: EntityAlreadyExists / 
Error attaching policy: NoSuchEntityException: Role my-role does not exist.

The immediate consequence: Your terraform apply fails mid-run. Depending on your state, you may have a partially-created infrastructure — the IAM role exists in AWS but the Lambda, EKS node group, or instance profile that depends on it was never created. Re-running apply immediately will often fail again because the propagation window hasn't closed.

IAM is a globally replicated service. When you call CreateRole, the API returns 200 OK the moment the role is written to the authoritative regional endpoint. However, the role hasn't yet replicated to all global IAM endpoints that services like Lambda, EC2, and EKS use to validate trust policies. This window is typically 5–15 seconds but can spike to 60+ seconds under AWS control plane load.

Terraform's dependency graph resolves this as a satisfied dependency the instant the aws_iam_role resource returns. There is no built-in propagation waiter.

The Attack Vector / Blast Radius

This isn't a security misconfiguration — it's a distributed systems race condition baked into AWS's IAM architecture. The blast radius:

CI/CD pipelines: A terraform apply in a GitHub Actions or Jenkins pipeline fails non-deterministically. It passes locally (where you waited 30 seconds between commands) and fails in automation (which runs at machine speed). This burns engineer time chasing a flaky pipeline.
Cascading partial state: If the role is created but the Lambda isn't, subsequent apply runs may attempt to re-create resources that partially exist, causing EntityAlreadyExists errors on top of the original failure.
EKS node group outages: If this race hits an EKS managed node group role attachment during a cluster upgrade, nodes fail to register and workloads go unscheduled. This is a production outage vector.
Retry storms: Naive terraform apply || terraform apply retry loops in pipelines can compound state drift.

How to Fix It (The Solution)

Basic Fix — `time_sleep` Propagation Gate

Add the hashicorp/time provider and insert a time_sleep resource as an explicit dependency barrier.

 terraform {
   required_providers {
+    time = {
+      source  = "hashicorp/time"
+      version = "~> 0.9"
+    }
   }
 }

 resource "aws_iam_role" "lambda_exec" {
   name               = "my-lambda-exec-role"
   assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
 }

+resource "time_sleep" "iam_propagation" {
+  depends_on      = [aws_iam_role.lambda_exec]
+  create_duration = "15s"
+}

 resource "aws_lambda_function" "my_function" {
   function_name = "my-function"
   role          = aws_iam_role.lambda_exec.arn
+  depends_on    = [time_sleep.iam_propagation]
   # ... rest of config
 }

Why 15 seconds? AWS documentation states eventual consistency can take "a few seconds" but observed production incidents show spikes to 10–15s under load. 15s is the pragmatic floor for CI/CD safety without meaningfully impacting pipeline duration.

Enterprise Best Practice — Scoped Sleep + Policy Attachment Ordering

In real infrastructure, you're attaching policies AND creating the role. Both operations feed into the propagation window. Gate on the last IAM mutation, not just role creation.

 resource "aws_iam_role" "eks_node" {
   name               = "eks-node-role"
   assume_role_policy = data.aws_iam_policy_document.ec2_assume.json
 }

 resource "aws_iam_role_policy_attachment" "eks_worker" {
-  role       = aws_iam_role.eks_node.name
+  role       = aws_iam_role.eks_node.name  # attachment must complete before sleep starts
   policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
 }

 resource "aws_iam_role_policy_attachment" "eks_cni" {
   role       = aws_iam_role.eks_node.name
   policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
 }

+resource "time_sleep" "iam_full_propagation" {
+  depends_on = [
+    aws_iam_role_policy_attachment.eks_worker,
+    aws_iam_role_policy_attachment.eks_cni,
+  ]
+  create_duration = "20s"
+}

 resource "aws_eks_node_group" "primary" {
   cluster_name    = aws_eks_cluster.main.name
   node_group_name = "primary"
   node_role_arn   = aws_iam_role.eks_node.arn
+  depends_on      = [time_sleep.iam_full_propagation]
   # ... rest of config
 }

Key principle: The time_sleep depends on the last IAM write operation in the chain (policy attachment), not just the role. Every CreateRole, AttachRolePolicy, and PutRolePolicy call resets the propagation clock.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Enforce time_sleep gates with Checkov custom policy

Write a Checkov custom check that fails any aws_lambda_function, aws_eks_node_group, or aws_iam_instance_profile resource that doesn't have a time_sleep resource in its depends_on chain when an aws_iam_role is present in the same module.

# .checkov/iam_propagation_gate.yaml
metadata:
  name: "IAM role must have propagation gate before consumption"
  id: "CKV_CUSTOM_IAM_01"
  severity: HIGH
definition:
  and:
    - resource_type: aws_lambda_function
      attribute: depends_on
      operator: contains
      value: time_sleep

2. OPA/Conftest policy for Terraform plan JSON

package terraform.iam

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_lambda_function"
  not has_sleep_dependency(resource)
  msg := sprintf("Lambda '%v' consumes IAM role without time_sleep propagation gate", [resource.name])
}

has_sleep_dependency(resource) {
  dep := resource.change.after.depends_on[_]
  startswith(dep, "time_sleep.")
}

Run this in your pipeline: terraform show -json tfplan.binary | conftest test -p policies/ -

3. terraform apply retry with state awareness (last resort)

If you cannot modify the Terraform code (legacy modules), wrap the apply with a bounded retry that checks for the specific error string:

for i in {1..3}; do
  terraform apply -auto-approve && break
  if terraform show | grep -q "NoSuchEntity\|cannot be assumed"; then
    echo "IAM propagation race detected, sleeping 20s before retry $i..."
    sleep 20
  else
    exit 1  # Different error, don't retry blindly
  fi
done

4. Terraform Cloud / Atlantis: Set TF_CLI_ARGS_apply=-parallelism=5 to slow the apply graph traversal and reduce the probability of the race. This is a band-aid, not a fix.