Fixing Terraform 'Error: timeout while waiting for state' During EC2 Destroy
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Terraform's EC2 destroy operation hit the default 20-minute state transition timeout because the instance is stuck in
stopping,shutting-down, or blocked by an attached resource (ENI, ELB, SSM session) that Terraform doesn't know to tear down first. - How to fix it: Force-detach blocking resources, increase
timeoutsblock values, and add explicitdepends_onor lifecycle ordering to your Terraform config. - Use our Client-Side Sandbox above to paste your failing
aws_instanceoraws_network_interfaceblock and auto-generate the refactored HCL with correct timeout and dependency ordering.
The Incident (What Does the Error Mean?)
Error: timeout while waiting for state to become 'destroyed' (last state: 'stopping', timeout: 20m0s)
with aws_instance.app_server,
on main.tf line 12, in resource "aws_instance" "app_server":
12: resource "aws_instance" "app_server" {
Terraform polls the AWS EC2 API waiting for the instance to reach terminated. When it doesn't within the configured window (default: 20 minutes), the provider throws this error and exits non-zero. The instance is NOT destroyed. Your state file still holds the resource. Re-running terraform destroy will attempt the same operation and likely timeout again unless the root cause is resolved. This is a hard blocker for environment teardown, blue/green rotation, and CI pipeline cleanup jobs.
The Attack Vector / Blast Radius
This isn't just a slow destroy. The cascading failure profile is severe:
- State file corruption risk: If you
terraform state rmthe instance to unblock the pipeline without actually terminating it in AWS, you now have a ghost instance — running, billing, and potentially network-accessible — with no IaC ownership. - Dependent resource deadlock: VPCs, subnets, and security groups attached to the stuck instance cannot be deleted. A single stuck EC2 can orphan an entire VPC stack.
- ENI leak: Elastic Network Interfaces attached by ECS, EKS node groups, or Lambda (yes, Lambda) frequently outlive the EC2 lifecycle. AWS will not terminate the instance until all ENIs are detached. Terraform's default resource graph does not always sequence this correctly without explicit
depends_on. - Active SSM/ELB sessions: An active AWS Systems Manager session or a registered ELB target keeps the instance in
stoppingindefinitely. Terraform has no visibility into these. - Billing: A
stoppinginstance still accrues EBS and data transfer costs. In an automated teardown pipeline, this compounds across multiple environments.
How to Fix It (The Solution)
Step 1: Identify the actual blocker
# Check instance state directly
aws ec2 describe-instances --instance-ids i-0abc123def456 \
--query 'Reservations[].Instances[].[InstanceId,State.Name]'
# Find attached ENIs that aren't being cleaned up
aws ec2 describe-network-interfaces \
--filters Name=attachment.instance-id,Values=i-0abc123def456
# Check for active SSM sessions
aws ssm describe-sessions --state Active \
--filters key=Target,value=i-0abc123def456
# Force terminate if stuck in stopping (nuclear option — confirm first)
aws ec2 terminate-instances --instance-ids i-0abc123def456
Basic Fix — Increase the Timeout Block
If the instance is genuinely slow to terminate (large EBS volumes, Windows shutdown scripts), the minimal fix is extending the timeout:
resource "aws_instance" "app_server" {
ami = var.ami_id
instance_type = "t3.large"
+ timeouts {
+ create = "10m"
+ update = "10m"
+ delete = "40m"
+ }
}
Enterprise Best Practice — Fix Dependency Ordering and ENI Lifecycle
The real fix is ensuring Terraform destroys in the correct order and explicitly manages ENI detachment:
resource "aws_network_interface" "app_eni" {
subnet_id = aws_subnet.app.id
security_groups = [aws_security_group.app.id]
+
+ # Force ENI to be destroyed before the instance attempts termination
+ lifecycle {
+ create_before_destroy = false
+ }
}
resource "aws_instance" "app_server" {
ami = var.ami_id
instance_type = "t3.large"
subnet_id = aws_subnet.app.id
+ # Explicit dependency ensures ENI cleanup is sequenced correctly
+ depends_on = [aws_network_interface.app_eni]
+ timeouts {
+ delete = "40m"
+ }
- # Missing: no lifecycle or dependency management
}
resource "aws_lb_target_group_attachment" "app" {
target_group_arn = aws_lb_target_group.app.arn
target_id = aws_instance.app_server.id
port = 8080
}
+# Explicit destroy-time dependency: deregister from LB before instance termination
+# Use a null_resource with destroy provisioner if aws provider doesn't sequence this
+resource "null_resource" "deregister_target" {
+ triggers = {
+ instance_id = aws_instance.app_server.id
+ target_group_arn = aws_lb_target_group.app.arn
+ }
+
+ provisioner "local-exec" {
+ when = destroy
+ command = <<-EOT
+ aws elbv2 deregister-targets \
+ --target-group-arn ${self.triggers.target_group_arn} \
+ --targets Id=${self.triggers.instance_id}
+ # Wait for deregistration to complete
+ sleep 30
+ EOT
+ }
+
+ depends_on = [aws_lb_target_group_attachment.app]
+}
If the Instance is Truly Stuck in AWS (Break-Glass)
# 1. Remove from Terraform state to unblock the pipeline
terraform state rm aws_instance.app_server
# 2. Manually terminate in AWS
aws ec2 terminate-instances --instance-ids i-0abc123def456
# 3. Clean up orphaned ENIs
aws ec2 delete-network-interface --network-interface-id eni-0xyz789
# 4. Re-import if needed or let next apply recreate
⚠️ Never skip step 2. A
terraform state rmwithout actual AWS termination creates a ghost instance.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Catch Missing Timeout Blocks Pre-Merge
# .checkov.yml
checks:
- CKV2_AWS_41 # Ensure EC2 instances have defined timeouts
Checkov doesn't have a built-in rule for timeout blocks out of the box — write a custom check:
# checkov/custom/check_ec2_timeout.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories
class EC2DestroyTimeout(BaseResourceCheck):
def __init__(self):
super().__init__(
name="Ensure aws_instance has delete timeout defined",
id="CKV_CUSTOM_EC2_TIMEOUT",
categories=[CheckCategories.GENERAL_SECURITY],
supported_resources=["aws_instance"]
)
def scan_resource_conf(self, conf):
timeouts = conf.get("timeouts", [{}])
if timeouts and timeouts[0].get("delete"):
return CheckResult.PASSED
return CheckResult.FAILED
2. OPA/Conftest Policy — Block Plans Without Dependency Ordering
# policies/ec2_destroy_safety.rego
package terraform.ec2
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
resource.change.actions[_] == "delete"
not resource.change.before.timeouts
msg := sprintf("EC2 instance '%v' has no timeout block. Destroy may hang.", [resource.address])
}
# .github/workflows/tf-plan.yml
- name: Conftest Policy Check
run: |
terraform show -json tfplan.binary > tfplan.json
conftest test tfplan.json --policy policies/
3. Pipeline-Level Destroy Guard
# Add a pre-destroy step to deregister targets and terminate SSM sessions
- name: Pre-Destroy Cleanup
run: |
INSTANCE_ID=$(terraform output -raw instance_id)
# Terminate active SSM sessions
aws ssm terminate-session \
--session-id $(aws ssm describe-sessions --state Active \
--filters key=Target,value=$INSTANCE_ID \
--query 'Sessions[0].SessionId' --output text) 2>/dev/null || true
# Deregister from all target groups
TG_ARN=$(terraform output -raw target_group_arn)
aws elbv2 deregister-targets \
--target-group-arn $TG_ARN \
--targets Id=$INSTANCE_ID 2>/dev/null || true
sleep 30
- name: Terraform Destroy
run: terraform destroy -auto-approve
timeout-minutes: 45
4. tflint Rule
# .tflint.hcl
rule "terraform_required_providers" {
enabled = true
}
# Add custom ruleset for timeout enforcement
plugin "aws" {
enabled = true
version = "0.27.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
Bottom line: The timeout error is a symptom. The disease is missing destroy-time dependency ordering and no pre-destroy deregistration hooks. Fix the graph, not just the clock.