Fixing Terraform 'Error: context canceled' During Large Apply or Destroy Operations
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 20–45 mins
TL;DR
- What broke: Terraform's internal Go context expired or was interrupted mid-apply/destroy, orphaning real cloud resources while the state file reflects a partial mutation — state drift is now live.
- How to fix it: Increase provider-level
timeoutblocks, reduce--parallelism, pin the provider to a version without the context-propagation bug, and re-run with-refresh=falseafter unlocking state. - Sandbox: Use our Client-Side Sandbox below to auto-refactor your provider and resource timeout blocks without leaking your credentials.
The Incident (What does the error mean?)
Raw terminal output during a 300-resource destroy:
╷
│ Error: context canceled
│
│ with aws_instance.worker[147],
│ on main.tf line 84, in resource "aws_instance" "worker":
│ 84: resource "aws_instance" "worker" {
│
│ context canceled
╵
Error: context canceled
with module.vpc.aws_route_table_association.private[3],
...
Terraform acquires a state lock. To unlock:
terraform force-unlock 8f3a2c1d-...
Immediate consequence: Terraform exits non-zero. The state lock is held. Some resources were destroyed or created; others were not. Your .tfstate now describes infrastructure that does not match reality. Every subsequent plan will produce a diff that may re-create already-existing resources or attempt to destroy already-deleted ones.
The Attack Vector / Blast Radius
This is not a transient warning — it is a state corruption event in progress.
Why it happens (root causes in order of frequency):
Provider HTTP client inherits the root context — The AWS/GCP/AzureRM provider SDK passes Terraform's root
context.Contextdirectly into API calls. On large applies, the provider's default HTTP timeout (often 20–30 min hardcoded) fires before all resources complete, canceling all in-flight goroutines simultaneously.SIGINT/SIGTERM from CI runner — GitHub Actions, Jenkins, and GitLab CI impose their own job-level timeouts. When the runner kills the process (
SIGTERM), Terraform's signal handler triggers a graceful shutdown that emitscontext canceledfor every pending operation.Parallelism exhaustion + provider rate limiting — Default
--parallelism=10with 200+ resources causes API rate-limit 429s. The provider retries inside a context that is already approaching deadline, then cancels.
Blast radius:
- State lock left open — All other pipelines targeting this workspace are blocked indefinitely.
- Partial resource graph — Dependencies mid-chain (e.g., a subnet destroyed but its route table association not yet removed) leave cloud resources in an invalid configuration that may incur cost or create security exposure (orphaned security groups with open rules).
- Drift accumulates silently — If the team runs
terraform apply -auto-approveto "fix" it without a freshplan, they risk double-creating resources or hitting naming collisions.
How to Fix It (The Solution)
Step 0: Unlock State Immediately
# Get the Lock ID from the error output
terraform force-unlock 8f3a2c1d-4b5e-...
Do not skip this. Every minute the lock is held blocks your entire team.
Basic Fix — Add Timeout Blocks and Reduce Parallelism
resource "aws_instance" "worker" {
count = 200
ami = var.ami_id
instance_type = "t3.medium"
+
+ timeouts {
+ create = "30m"
+ update = "30m"
+ delete = "30m"
+ }
}
- terraform apply
+ terraform apply -parallelism=5
Lowering parallelism reduces concurrent API calls, which reduces the chance of hitting rate limits that cause retries that exhaust the context deadline.
Enterprise Best Practice — Provider-Level Timeout + Retry Configuration
terraform {
required_providers {
aws = {
source = "hashicorp/aws"-
- version = "~> 4.0"
+ version = "~> 5.54" # context propagation fixes backported in 5.x
}
}
}
provider "aws" {
region = var.region
+
+ # Increase HTTP client timeout at the provider level
+ # This overrides the SDK default and prevents premature context cancellation
+ http_proxy = var.http_proxy
+
+ default_tags {
+ tags = local.common_tags
+ }
}
For the CI runner timeout problem, configure the pipeline wrapper — not Terraform itself:
# .github/workflows/terraform.yml
jobs:
terraform-apply:
- timeout-minutes: 30
+ timeout-minutes: 120
steps:
- name: Terraform Apply
run: |
- terraform apply -auto-approve
+ terraform apply -auto-approve -parallelism=5
env:
TF_CLI_ARGS: "-no-color"
For Terragrunt or wrapper scripts, pass the context explicitly:
- terraform destroy -auto-approve
+ terraform destroy -auto-approve -parallelism=3 -refresh=false
-refresh=false skips the pre-destroy refresh API calls, which alone can consume 30–40% of your context deadline on large state files.
Recovery Sequence After a Partial Apply
# 1. Unlock state
terraform force-unlock <LOCK_ID>
# 2. Refresh state to reconcile drift — do NOT skip
terraform refresh
# 3. Inspect the plan before touching anything
terraform plan -out=recovery.tfplan
# 4. Review recovery.tfplan manually, then apply
terraform apply recovery.tfplan -parallelism=3
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Enforce Timeout Blocks on Long-Running Resources
# .checkov.yaml
checks:
- id: CKV_TF_RESOURCE_TIMEOUTS
# Custom check: fail if aws_instance, aws_db_instance, google_container_cluster
# are missing explicit timeout blocks
Use checkov -d . --check CKV2_AWS_* as a pre-apply gate in your pipeline.
2. OPA/Conftest Policy — Block applies without parallelism cap
# policy/terraform_apply.rego
package terraform.apply
deny[msg] {
input.parallelism > 10
msg := "Parallelism must be <= 10 for applies touching > 100 resources to prevent context cancellation under API rate limits."
}
3. Terraform Cloud / Enterprise — Use Remote Runs with Extended Timeouts
If you are running terraform apply locally or in ephemeral CI containers, migrate large workspace applies to Terraform Cloud remote execution or Atlantis with persistent runners. Remote runs are not subject to CI job-level timeout-minutes and have configurable execution timeouts up to 24 hours.
4. Split Large Workspaces
The real fix for consistently hitting context deadlines on destroy/apply is workspace decomposition. A workspace with 300+ resources is an anti-pattern:
infra/
├── networking/ # VPCs, subnets, route tables — ~30 resources
├── compute/ # EC2, ASGs — ~80 resources
├── data/ # RDS, ElastiCache — ~40 resources
└── app/ # ECS services, ALBs — ~60 resources
Each workspace applies independently. Context cancellation in compute/ does not affect networking/. State locks are isolated. Blast radius is contained.