How do I recover a stuck Terraform state lock after context canceled?

Run `terraform force-unlock ` using the Lock ID printed in the error output. If the ID is not visible, retrieve it from your backend directly — for S3/DynamoDB backends, query the DynamoDB lock table: `aws dynamodb scan --table-name `. After unlocking, run `terraform refresh` before any further apply or destroy to reconcile state drift from the partial operation.

Does -parallelism=1 guarantee no context canceled errors?

It significantly reduces the probability by serializing all API calls, eliminating rate-limit retries that exhaust context deadlines. However, it does not fix a hard CI runner job timeout — if your GitHub Actions job has a 30-minute timeout and your serial destroy takes 45 minutes, you will still hit context canceled. Address both: lower parallelism AND increase the CI job timeout-minutes.

Which Terraform provider versions fixed the context propagation bug?

The AWS provider addressed the most severe context-propagation issues in the v5.x series (specifically around 5.30+). The Google provider fixed similar issues in v5.x as well. Always check the provider's GitHub changelog for 'context canceled', 'context deadline', or 'HTTP client timeout' fix entries before pinning a version. Running `terraform providers lock` after upgrading ensures your lock file is updated consistently across all team members and CI runners.

Fixing Terraform 'Error: context canceled' During Large Apply or Destroy Operations

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 20–45 mins

TL;DR

What broke: Terraform's internal Go context expired or was interrupted mid-apply/destroy, orphaning real cloud resources while the state file reflects a partial mutation — state drift is now live.
How to fix it: Increase provider-level timeout blocks, reduce --parallelism, pin the provider to a version without the context-propagation bug, and re-run with -refresh=false after unlocking state.
Sandbox: Use our Client-Side Sandbox below to auto-refactor your provider and resource timeout blocks without leaking your credentials.

The Incident (What does the error mean?)

Raw terminal output during a 300-resource destroy:

╷
│ Error: context canceled
│
│   with aws_instance.worker[147],
│   on main.tf line 84, in resource "aws_instance" "worker":
│   84: resource "aws_instance" "worker" {
│
│ context canceled
╵

Error: context canceled

  with module.vpc.aws_route_table_association.private[3],
...

Terraform acquires a state lock. To unlock:
  terraform force-unlock 8f3a2c1d-...

Immediate consequence: Terraform exits non-zero. The state lock is held. Some resources were destroyed or created; others were not. Your .tfstate now describes infrastructure that does not match reality. Every subsequent plan will produce a diff that may re-create already-existing resources or attempt to destroy already-deleted ones.

The Attack Vector / Blast Radius

This is not a transient warning — it is a state corruption event in progress.

Why it happens (root causes in order of frequency):

Provider HTTP client inherits the root context — The AWS/GCP/AzureRM provider SDK passes Terraform's root context.Context directly into API calls. On large applies, the provider's default HTTP timeout (often 20–30 min hardcoded) fires before all resources complete, canceling all in-flight goroutines simultaneously.
SIGINT/SIGTERM from CI runner — GitHub Actions, Jenkins, and GitLab CI impose their own job-level timeouts. When the runner kills the process (SIGTERM), Terraform's signal handler triggers a graceful shutdown that emits context canceled for every pending operation.
Parallelism exhaustion + provider rate limiting — Default --parallelism=10 with 200+ resources causes API rate-limit 429s. The provider retries inside a context that is already approaching deadline, then cancels.

Blast radius:

State lock left open — All other pipelines targeting this workspace are blocked indefinitely.
Partial resource graph — Dependencies mid-chain (e.g., a subnet destroyed but its route table association not yet removed) leave cloud resources in an invalid configuration that may incur cost or create security exposure (orphaned security groups with open rules).
Drift accumulates silently — If the team runs terraform apply -auto-approve to "fix" it without a fresh plan, they risk double-creating resources or hitting naming collisions.

How to Fix It (The Solution)

Step 0: Unlock State Immediately

# Get the Lock ID from the error output
terraform force-unlock 8f3a2c1d-4b5e-...

Do not skip this. Every minute the lock is held blocks your entire team.

Basic Fix — Add Timeout Blocks and Reduce Parallelism

 resource "aws_instance" "worker" {
   count         = 200
   ami           = var.ami_id
   instance_type = "t3.medium"
+
+  timeouts {
+    create = "30m"
+    update = "30m"
+    delete = "30m"
+  }
 }

- terraform apply
+ terraform apply -parallelism=5

Lowering parallelism reduces concurrent API calls, which reduces the chance of hitting rate limits that cause retries that exhaust the context deadline.

Enterprise Best Practice — Provider-Level Timeout + Retry Configuration

 terraform {
   required_providers {
     aws = {
       source  = "hashicorp/aws"-
-      version = "~> 4.0"
+      version = "~> 5.54"  # context propagation fixes backported in 5.x
     }
   }
 }
 
 provider "aws" {
   region = var.region
+
+  # Increase HTTP client timeout at the provider level
+  # This overrides the SDK default and prevents premature context cancellation
+  http_proxy = var.http_proxy
+
+  default_tags {
+    tags = local.common_tags
+  }
 }

For the CI runner timeout problem, configure the pipeline wrapper — not Terraform itself:

 # .github/workflows/terraform.yml
 jobs:
   terraform-apply:
-    timeout-minutes: 30
+    timeout-minutes: 120
     steps:
       - name: Terraform Apply
         run: |
-          terraform apply -auto-approve
+          terraform apply -auto-approve -parallelism=5
         env:
           TF_CLI_ARGS: "-no-color"

For Terragrunt or wrapper scripts, pass the context explicitly:

- terraform destroy -auto-approve
+ terraform destroy -auto-approve -parallelism=3 -refresh=false

-refresh=false skips the pre-destroy refresh API calls, which alone can consume 30–40% of your context deadline on large state files.

Recovery Sequence After a Partial Apply

# 1. Unlock state
terraform force-unlock <LOCK_ID>

# 2. Refresh state to reconcile drift — do NOT skip
terraform refresh

# 3. Inspect the plan before touching anything
terraform plan -out=recovery.tfplan

# 4. Review recovery.tfplan manually, then apply
terraform apply recovery.tfplan -parallelism=3

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov — Enforce Timeout Blocks on Long-Running Resources

# .checkov.yaml
checks:
  - id: CKV_TF_RESOURCE_TIMEOUTS
    # Custom check: fail if aws_instance, aws_db_instance, google_container_cluster
    # are missing explicit timeout blocks

Use checkov -d . --check CKV2_AWS_* as a pre-apply gate in your pipeline.

2. OPA/Conftest Policy — Block applies without parallelism cap

# policy/terraform_apply.rego
package terraform.apply

deny[msg] {
  input.parallelism > 10
  msg := "Parallelism must be <= 10 for applies touching > 100 resources to prevent context cancellation under API rate limits."
}

3. Terraform Cloud / Enterprise — Use Remote Runs with Extended Timeouts

If you are running terraform apply locally or in ephemeral CI containers, migrate large workspace applies to Terraform Cloud remote execution or Atlantis with persistent runners. Remote runs are not subject to CI job-level timeout-minutes and have configurable execution timeouts up to 24 hours.

4. Split Large Workspaces

The real fix for consistently hitting context deadlines on destroy/apply is workspace decomposition. A workspace with 300+ resources is an anti-pattern:

 infra/
 ├── networking/     # VPCs, subnets, route tables — ~30 resources
 ├── compute/        # EC2, ASGs — ~80 resources  
 ├── data/           # RDS, ElastiCache — ~40 resources
 └── app/            # ECS services, ALBs — ~60 resources

Each workspace applies independently. Context cancellation in compute/ does not affect networking/. State locks are isolated. Blast radius is contained.