Fixing Terraform IAM Role Eventual Consistency: NoSuchEntity Error After Role Creation
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins
TL;DR
- What broke: Terraform created an IAM role and immediately tried to attach it to a Lambda, EC2 instance profile, or EKS node group. AWS IAM's global replication hadn't propagated the role yet — the downstream resource API returned
NoSuchEntityorInvalidParameterValue: arn:aws:iam::123456789:role/my-role is not authorized. - How to fix it: Inject an explicit propagation delay using the
hashicorp/timeprovider'stime_sleepresource as adepends_ongate between the role and any resource that consumes it. - Fast path: Use our Client-Side Sandbox below to auto-refactor this — paste your failing
.tfblock and get the corrected dependency graph with the sleep gate injected.
The Incident (What Does the Error Mean?)
The raw error surfaces in terraform apply output like this:
Error: error creating Lambda Function (my-function): InvalidParameterValueException:
The role defined for the function cannot be assumed by Lambda.
{RespMetadata: {StatusCode:400},
Message_: "The role defined for the function cannot be assumed by Lambda."
}
with aws_lambda_function.my_function,
on lambda.tf line 12, in resource "aws_lambda_function" "my_function":
12: resource "aws_lambda_function" "my_function" {
Or for EC2/EKS:
Error: Error creating IAM instance profile: EntityAlreadyExists /
Error attaching policy: NoSuchEntityException: Role my-role does not exist.
The immediate consequence: Your terraform apply fails mid-run. Depending on your state, you may have a partially-created infrastructure — the IAM role exists in AWS but the Lambda, EKS node group, or instance profile that depends on it was never created. Re-running apply immediately will often fail again because the propagation window hasn't closed.
IAM is a globally replicated service. When you call CreateRole, the API returns 200 OK the moment the role is written to the authoritative regional endpoint. However, the role hasn't yet replicated to all global IAM endpoints that services like Lambda, EC2, and EKS use to validate trust policies. This window is typically 5–15 seconds but can spike to 60+ seconds under AWS control plane load.
Terraform's dependency graph resolves this as a satisfied dependency the instant the aws_iam_role resource returns. There is no built-in propagation waiter.
The Attack Vector / Blast Radius
This isn't a security misconfiguration — it's a distributed systems race condition baked into AWS's IAM architecture. The blast radius:
- CI/CD pipelines: A
terraform applyin a GitHub Actions or Jenkins pipeline fails non-deterministically. It passes locally (where you waited 30 seconds between commands) and fails in automation (which runs at machine speed). This burns engineer time chasing a flaky pipeline. - Cascading partial state: If the role is created but the Lambda isn't, subsequent
applyruns may attempt to re-create resources that partially exist, causingEntityAlreadyExistserrors on top of the original failure. - EKS node group outages: If this race hits an EKS managed node group role attachment during a cluster upgrade, nodes fail to register and workloads go unscheduled. This is a production outage vector.
- Retry storms: Naive
terraform apply || terraform applyretry loops in pipelines can compound state drift.
How to Fix It (The Solution)
Basic Fix — time_sleep Propagation Gate
Add the hashicorp/time provider and insert a time_sleep resource as an explicit dependency barrier.
terraform {
required_providers {
+ time = {
+ source = "hashicorp/time"
+ version = "~> 0.9"
+ }
}
}
resource "aws_iam_role" "lambda_exec" {
name = "my-lambda-exec-role"
assume_role_policy = data.aws_iam_policy_document.lambda_assume.json
}
+resource "time_sleep" "iam_propagation" {
+ depends_on = [aws_iam_role.lambda_exec]
+ create_duration = "15s"
+}
resource "aws_lambda_function" "my_function" {
function_name = "my-function"
role = aws_iam_role.lambda_exec.arn
+ depends_on = [time_sleep.iam_propagation]
# ... rest of config
}
Why 15 seconds? AWS documentation states eventual consistency can take "a few seconds" but observed production incidents show spikes to 10–15s under load. 15s is the pragmatic floor for CI/CD safety without meaningfully impacting pipeline duration.
Enterprise Best Practice — Scoped Sleep + Policy Attachment Ordering
In real infrastructure, you're attaching policies AND creating the role. Both operations feed into the propagation window. Gate on the last IAM mutation, not just role creation.
resource "aws_iam_role" "eks_node" {
name = "eks-node-role"
assume_role_policy = data.aws_iam_policy_document.ec2_assume.json
}
resource "aws_iam_role_policy_attachment" "eks_worker" {
- role = aws_iam_role.eks_node.name
+ role = aws_iam_role.eks_node.name # attachment must complete before sleep starts
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
}
resource "aws_iam_role_policy_attachment" "eks_cni" {
role = aws_iam_role.eks_node.name
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
}
+resource "time_sleep" "iam_full_propagation" {
+ depends_on = [
+ aws_iam_role_policy_attachment.eks_worker,
+ aws_iam_role_policy_attachment.eks_cni,
+ ]
+ create_duration = "20s"
+}
resource "aws_eks_node_group" "primary" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "primary"
node_role_arn = aws_iam_role.eks_node.arn
+ depends_on = [time_sleep.iam_full_propagation]
# ... rest of config
}
Key principle: The time_sleep depends on the last IAM write operation in the chain (policy attachment), not just the role. Every CreateRole, AttachRolePolicy, and PutRolePolicy call resets the propagation clock.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Enforce time_sleep gates with Checkov custom policy
Write a Checkov custom check that fails any aws_lambda_function, aws_eks_node_group, or aws_iam_instance_profile resource that doesn't have a time_sleep resource in its depends_on chain when an aws_iam_role is present in the same module.
# .checkov/iam_propagation_gate.yaml
metadata:
name: "IAM role must have propagation gate before consumption"
id: "CKV_CUSTOM_IAM_01"
severity: HIGH
definition:
and:
- resource_type: aws_lambda_function
attribute: depends_on
operator: contains
value: time_sleep
2. OPA/Conftest policy for Terraform plan JSON
package terraform.iam
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_lambda_function"
not has_sleep_dependency(resource)
msg := sprintf("Lambda '%v' consumes IAM role without time_sleep propagation gate", [resource.name])
}
has_sleep_dependency(resource) {
dep := resource.change.after.depends_on[_]
startswith(dep, "time_sleep.")
}
Run this in your pipeline: terraform show -json tfplan.binary | conftest test -p policies/ -
3. terraform apply retry with state awareness (last resort)
If you cannot modify the Terraform code (legacy modules), wrap the apply with a bounded retry that checks for the specific error string:
for i in {1..3}; do
terraform apply -auto-approve && break
if terraform show | grep -q "NoSuchEntity\|cannot be assumed"; then
echo "IAM propagation race detected, sleeping 20s before retry $i..."
sleep 20
else
exit 1 # Different error, don't retry blindly
fi
done
4. Terraform Cloud / Atlantis: Set TF_CLI_ARGS_apply=-parallelism=5 to slow the apply graph traversal and reduce the probability of the race. This is a band-aid, not a fix.