Why does 'context deadline exceeded' only happen on terraform destroy and not apply?

Delete operations are almost always slower than create operations in cloud APIs. Creating an RDS instance provisions from a pre-warmed pool; deleting it requires snapshot creation, replication drain, storage deallocation, and backup cleanup. The same default timeout that is adequate for creation (15–20 min) is frequently insufficient for deletion of large, multi-AZ, or high-storage resources. Terraform's provider SDK sets the context deadline at operation start and does not extend it dynamically based on API polling signals.

My terraform state is now corrupted after the timeout. How do I safely recover without data loss?

First, run `terraform force-unlock ` to release the DynamoDB lock. Then run `terraform state list` to identify which resources are in a tainted or missing state. Use `terraform state rm` to remove zombie entries from the state file — this does NOT delete the actual cloud resource. Then manually delete the resource via the AWS/GCP/Azure CLI using the appropriate `wait` command to confirm deletion. Finally, verify with `terraform plan` that the state matches reality before attempting any further apply or destroy operations. Always take a state file backup with `terraform state pull > backup.tfstate` before any manual state manipulation.

What is the maximum timeout value I can set in a Terraform resource timeouts block?

Terraform accepts Go duration strings (e.g., '90m', '2h', '120m'). There is no hard upper limit enforced by Terraform itself — you can set '24h' if needed. However, your CI/CD runner job timeout, your IAM role session duration (for assume_role), and any API gateway or load balancer idle timeout sitting in front of the cloud provider API endpoint are all independent ceilings that can still kill the operation. For very long operations, ensure your `assume_role` session `duration_seconds` exceeds your longest expected timeout, and set your CI job `timeout-minutes` to at least 30 minutes beyond your largest resource delete timeout.

How to Fix Terraform Destroy 'context deadline exceeded' Timeout Error

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: terraform destroy issued a DELETE call to the cloud provider API; the operation (RDS multi-AZ deletion, EKS cluster teardown, large S3 bucket drain, etc.) exceeded the default provider-level or OS-level context deadline before the API returned a success signal, leaving Terraform's state file in a partially-destroyed, inconsistent condition.
How to fix it: Raise resource-level timeouts blocks inside the offending resource, increase the provider's max_retries, and — if the state is already corrupted — surgically terraform state rm the zombie resources and re-import or manually delete them.
Shortcut: Use our Client-Side Sandbox below to auto-refactor your provider and resource timeout config without uploading your credentials anywhere.

The Incident (What Does the Error Mean?)

Raw terminal output from a failed destroy run:

╷
│ Error: context deadline exceeded
│
│   with aws_eks_cluster.prod,
│   on main.tf line 12, in resource "aws_eks_cluster" "prod":
│   12: resource "aws_eks_cluster" "prod" {
│
│ timeout while waiting for state to become 'DELETED'
│ (last state: 'DELETING', timeout: 15m0s)
╵

Error: context deadline exceeded

  with module.database.aws_db_instance.primary,
  on modules/database/main.tf line 5, in resource "aws_db_instance" "primary":
│   5: resource "aws_db_instance" "primary" {

Immediate consequence: Terraform exits non-zero. The .tfstate file now records some resources as "status": "tainted" or simply absent, while the actual cloud resources still exist and are actively billing you. Any subsequent terraform apply or terraform destroy will either re-create orphaned resources or fail with dependency errors because the state no longer reflects reality. This is not a recoverable situation without manual intervention.

The Attack Vector / Blast Radius

This is not just an inconvenience — a partial destroy creates a split-brain infrastructure state:

State file drift: Resources deleted from state but still alive in the cloud become unmanaged. They accumulate costs, retain their security group rules, and hold onto IAM role bindings that your team believes are gone.
Dependency deadlocks on re-run: If an EKS cluster timed out mid-deletion, its VPC subnets and security groups are still attached. A re-run of terraform destroy will hit ENI attachment errors and cascade into additional timeouts.
Data loss risk on retry: Blindly re-running destroy after a timeout on an aws_db_instance with skip_final_snapshot = true means the next successful delete call will drop the database with zero snapshot — the timeout may have been the only thing that saved your data.
CI/CD pipeline poisoning: If this runs in a GitHub Actions or Atlantis pipeline, the pipeline hangs until its own job timeout kills it, leaving locks in your DynamoDB state lock table. Every subsequent Terraform operation is blocked until you manually release the lock with terraform force-unlock <LOCK_ID>.
Cross-resource blast radius: Resources with depends_on the timed-out resource will never receive their destroy signal. You end up with IAM roles, KMS keys, and Route53 records that are permanently orphaned in the account.

How to Fix It (The Solution)

Step 0: Release the State Lock First

If the process was killed mid-run, the DynamoDB lock table entry is stale. Get the lock ID from the error output and release it:

terraform force-unlock <LOCK_ID>

Verify no other process holds the lock before proceeding.

Basic Fix: Add Resource-Level Timeout Blocks

The most common root cause is that the default timeout baked into the AWS/GCP/AzureRM provider (often 15–20 minutes) is too short for large or complex resources. Override it at the resource level.

 resource "aws_eks_cluster" "prod" {
   name     = var.cluster_name
   role_arn = aws_iam_role.eks.arn

   vpc_config {
     subnet_ids = var.subnet_ids
   }

+  timeouts {
+    create = "30m"
+    update = "60m"
+    delete = "60m"
+  }
 }

 resource "aws_db_instance" "primary" {
   identifier        = "prod-postgres"
   engine            = "postgres"
   instance_class    = "db.r6g.2xlarge"
   multi_az          = true
   skip_final_snapshot = false
   final_snapshot_identifier = "prod-postgres-final-${formatdate("YYYYMMDD", timestamp())}"

+  timeouts {
+    create = "40m"
+    update = "80m"
+    delete = "90m"
+  }
 }

Why delete needs to be longest: Multi-AZ RDS teardown requires AWS to delete the standby replica, wait for replication drain, delete the primary, then clean up automated backups. On large db.r6g.2xlarge or bigger, this routinely takes 45–75 minutes.

Enterprise Best Practice: Provider-Level Tuning + Partial State Recovery

For teams managing large environments, the fix must be systematic, not per-resource.

1. Increase provider-level retry and HTTP client timeout:

 provider "aws" {
   region = var.aws_region

+  # Increase HTTP client timeout for long-poll operations
+  max_retries = 10
+
+  # Use assume_role with a longer session for destroy pipelines
+  assume_role {
+    role_arn     = var.deploy_role_arn
+    session_name = "terraform-destroy-${var.environment}"
+    duration_seconds = 7200  # 2 hours for long destroy operations
+  }
 }

2. Handle already-tainted state — remove zombie resources from state and destroy manually:

# List all resources currently in state to identify zombies
terraform state list

# Remove the timed-out resource from state tracking
# (does NOT delete the cloud resource — you handle that manually or via CLI)
terraform state rm 'aws_eks_cluster.prod'
terraform state rm 'module.database.aws_db_instance.primary'

# Manually delete via AWS CLI with explicit wait
aws eks delete-cluster --name prod-cluster --region us-east-1
aws eks wait cluster-deleted --name prod-cluster --region us-east-1

aws rds delete-db-instance \
  --db-instance-identifier prod-postgres \
  --final-db-snapshot-identifier prod-postgres-final-manual \
  --region us-east-1
aws rds wait db-instance-deleted \
  --db-instance-identifier prod-postgres \
  --region us-east-1

3. For S3 buckets with millions of objects (a common timeout culprit), use a pre-destroy null_resource to drain the bucket:

+resource "null_resource" "drain_bucket" {
+  triggers = {
+    bucket_name = aws_s3_bucket.data_lake.bucket
+  }
+
+  provisioner "local-exec" {
+    when    = destroy
+    command = <<-EOT
+      aws s3 rm s3://${self.triggers.bucket_name} \
+        --recursive \
+        --region ${var.aws_region}
+    EOT
+  }
+}
+
 resource "aws_s3_bucket" "data_lake" {
   bucket        = var.bucket_name
   force_destroy = true

+  depends_on = [null_resource.drain_bucket]
 }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov policy — enforce timeout blocks on slow-to-delete resources:

Add a custom Checkov check or use OPA/Conftest to gate PRs:

# conftest policy: enforce delete timeouts on EKS and RDS
package terraform.timeouts

violation[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_eks_cluster"
  not resource.change.after.timeouts
  msg := sprintf("Resource %v is missing a timeouts block. EKS deletion can exceed 20 min default.", [resource.address])
}

violation[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_db_instance"
  not resource.change.after.timeouts
  msg := sprintf("Resource %v is missing a timeouts block. Multi-AZ RDS deletion can exceed 60 min.", [resource.address])
}

2. GitHub Actions — set explicit job timeout AND handle state lock cleanup:

jobs:
  terraform-destroy:
    runs-on: ubuntu-latest
    timeout-minutes: 120  # Never let the runner die silently
    steps:
      - name: Terraform Destroy
        id: destroy
        run: terraform destroy -auto-approve
        continue-on-error: true

      - name: Release stale state lock on failure
        if: steps.destroy.outcome == 'failure'
        run: |
          LOCK_ID=$(terraform force-unlock -force $(terraform state list 2>&1 | grep 'Lock Info' | awk '{print $NF}') 2>&1 || true)
          echo "Lock release attempted: $LOCK_ID"
        env:
          TF_VAR_environment: ${{ vars.ENVIRONMENT }}

3. tflint rule — flag missing timeouts on known slow resources:

Add to .tflint.hcl:

plugin "aws" {
  enabled = true
  version = "0.27.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

# Custom rule: require timeouts block on EKS, RDS, ElastiCache
rule "aws_eks_cluster_invalid_timeouts" {
  enabled = true
}

4. For long-running environments, use -parallelism tuning to avoid API throttling that causes cascading timeouts:

# Default is 10 concurrent operations — reduce for accounts with tight API rate limits
terraform destroy -parallelism=3 -auto-approve

Reducing parallelism prevents the AWS API from returning ThrottlingException responses that Terraform misinterprets as operation failures, which trigger premature context cancellation.