What is the default Terraform parallelism and why does it cause API rate limit errors?

Terraform defaults to `parallelism=10`, meaning it executes up to 10 concurrent resource operations simultaneously. On a state file with 200+ resources, a single `terraform refresh` can generate thousands of API calls within seconds across multiple AWS services. Most AWS APIs have per-service rate limits (e.g., EC2 `DescribeInstances` ~100 req/s, IAM `GetRole` ~20 req/s) that are easily exceeded by this default, causing `ThrottlingException` errors that abort the entire refresh operation.

Is it safe to run `terraform plan -refresh=false` in production?

Yes, with a controlled workflow. Running `plan -refresh=false` tells Terraform to trust the existing state file rather than re-querying the cloud APIs. It is safe when you have recently run a targeted `terraform apply -refresh-only` to update state, or when you are confident no out-of-band changes have occurred. The risk is operating on stale state — if someone manually changed a resource in the console, Terraform won't detect the drift. The recommended pattern is: run `-refresh-only` on a schedule (nightly) for drift detection, and use `-refresh=false` for PR validation pipelines to avoid throttling on every commit.

How do I fix a stuck Terraform state lock caused by a failed refresh due to rate limiting?

A failed `terraform refresh` that errors mid-execution may not release the DynamoDB state lock. First, identify the lock ID from the error output or by querying DynamoDB directly: `aws dynamodb scan --table-name terraform-state-lock`. Then force-unlock using the lock ID: `terraform force-unlock `. **Only do this if you are certain no other Terraform process is actively running against that state.** After unlocking, address the rate limit root cause (reduce parallelism, add retries) before re-running. Never force-unlock a state that another legitimate process holds — it will cause state corruption.

How to Fix Terraform 'Error: Failed to refresh state' API Rate Limit Errors

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: terraform refresh (or plan/apply with implicit refresh) is issuing parallel Describe*/Get*/List* API calls across hundreds of resources simultaneously, exceeding your cloud provider's per-second or burst rate limit — Terraform aborts the entire state refresh, leaving terraform.tfstate potentially stale.
How to fix it: Reduce Terraform's default parallelism from 10 to 3–5, add provider-level retry/backoff configuration, and surgically use -target to scope refreshes to only the affected resource graph.
Use our Client-Side Sandbox above to paste your provider block and resource count — it will auto-generate the tuned configuration with correct parallelism and retry settings.

The Incident (What Does the Error Mean?)

Raw error output from a live AWS environment:

Error: Failed to refresh state

Causes:
  - aws_instance.web[42]: error reading EC2 Instance (i-0abc123def456789): 
    RequestError: send request failed
    caused by: Post "https://ec2.us-east-1.amazonaws.com/": 
    ThrottlingException: Rate exceeded
    status code: 400, request id: a1b2c3d4-...

  - aws_security_group.app[17]: error reading Security Group (sg-0def987abc): 
    ThrottlingException: Rate exceeded

  - aws_iam_role.lambda_exec[3]: error reading IAM Role (lambda-exec-prod): 
    Throttling: Rate exceeded
    status code: 400

Immediate consequence: Terraform cannot reconcile the difference between your .tfstate file and the actual cloud state. Every subsequent plan or apply will either fail outright or operate on stale state — meaning Terraform may attempt to recreate resources that already exist, or skip drift that genuinely occurred. In a production environment, this is a silent correctness failure, not just a slowdown.

The Attack Vector / Blast Radius

Terraform's default parallelism = 10 means it fires 10 concurrent API calls per resource operation. On a state with 200+ resources — common in any non-trivial AWS account — a single terraform refresh generates 2,000+ API calls in under 60 seconds across EC2, IAM, RDS, S3, and VPC APIs simultaneously.

AWS API rate limits are per-service, per-region, per-account. EC2 DescribeInstances allows ~100 req/s. IAM GetRole allows ~20 req/s. When Terraform floods all of these simultaneously:

Partial refresh failure: Some resources refresh successfully, others don't. Your state is now partially stale — the worst possible outcome because Terraform believes it has accurate state when it doesn't.
Cascading plan corruption: A stale aws_security_group state causes Terraform to generate a diff that destroys and recreates the SG, which cascades to every EC2 instance referencing it.
CI/CD pipeline deadlock: In a pipeline with terraform plan on every PR, throttling errors cause non-deterministic failures. Engineers re-run pipelines, which compounds the throttling, triggering AWS account-level throttling that affects other services and teams.
State lock contention: If using remote state (S3 + DynamoDB), a failed refresh that doesn't cleanly release the state lock forces manual terraform force-unlock operations — a dangerous intervention in production.

The blast radius in a multi-workspace monorepo is severe. One throttled workspace can exhaust the API budget for the entire AWS account, breaking deployments in unrelated workspaces.

How to Fix It (The Solution)

Basic Fix — Reduce Parallelism at CLI Level

Drop parallelism immediately to stop the bleeding:

# Instead of:
terraform refresh

# Use:
terraform refresh -parallelism=3

# Or for plan/apply:
terraform plan -parallelism=3 -refresh=true

For targeted refresh when you know the affected resource:

terraform plan -target=aws_instance.web -target=aws_security_group.app -parallelism=3

Enterprise Best Practice — Provider-Level Retry + Parallelism Configuration

Hardcode the fix at the provider level so it applies to every operation, not just CLI invocations:

# versions.tf
 terraform {
   required_providers {
     aws = {
       source  = "hashicorp/aws"-      version = "~> 5.0"
+      version = "~> 5.31" # 5.31+ has improved retry backoff
     }
   }
 }

# provider.tf
 provider "aws" {
   region = var.aws_region
-  # No retry configuration — Terraform uses SDK defaults (3 retries, no jitter)
+
+  # Increase max retries with exponential backoff for throttling errors
+  max_retries = 10
+
+  default_tags {
+    tags = local.common_tags
+  }
 }

# terraform.tf — root module backend/workspace config
 terraform {
   backend "s3" {
     bucket         = "my-tfstate-bucket"
     key            = "prod/terraform.tfstate"
     region         = "us-east-1"
     dynamodb_table = "terraform-state-lock"
     encrypt        = true
   }
 }

-# No parallelism override — defaults to 10
+# Enforce reduced parallelism for all operations in this workspace
+# Set via TF_CLI_ARGS_plan, TF_CLI_ARGS_apply, TF_CLI_ARGS_refresh env vars
+# in your CI/CD system (GitHub Actions, GitLab CI, etc.):
+# TF_CLI_ARGS_plan="-parallelism=5"
+# TF_CLI_ARGS_apply="-parallelism=5"
+# TF_CLI_ARGS_refresh="-parallelism=5"

# For large state files: disable automatic refresh on plan,
# run explicit targeted refreshes instead

-# Old workflow (hammers all APIs on every plan):
-# terraform plan

+# New workflow — decouple refresh from plan:
+# Step 1: Refresh only changed resource types
+# terraform apply -refresh-only -target=module.compute -parallelism=3
+
+# Step 2: Plan without re-refresh (uses state from step 1)
+# terraform plan -refresh=false
+
+# Step 3: Apply
+# terraform apply -refresh=false

For AWS specifically — request a rate limit increase via Service Quotas for the APIs you're hitting most (DescribeInstances, DescribeSecurityGroups, GetRole). This is a 2-minute console operation and takes effect within minutes for most services.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Enforce parallelism via environment variables in your pipeline — not per-command flags:

# .github/workflows/terraform.yml
env:
  TF_CLI_ARGS_plan: "-parallelism=5"
  TF_CLI_ARGS_apply: "-parallelism=5"
  TF_CLI_ARGS_refresh: "-parallelism=3"

2. Add a Checkov policy to catch missing max_retries on AWS provider:

# checkov/custom_checks/check_aws_provider_retries.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories

class AWSProviderMaxRetries(BaseResourceCheck):
    def __init__(self):
        name = "Ensure AWS provider has max_retries configured"
        id = "CKV_CUSTOM_AWS_001"
        supported_resources = ["provider"]
        categories = [CheckCategories.GENERAL_SECURITY]
        super().__init__(name=name, id=id, categories=categories,
                         supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        if conf.get("name") == "aws":
            retries = conf.get("max_retries", [None])
            if retries and int(retries[0]) >= 5:
                return CheckResult.PASSED
        return CheckResult.FAILED

3. Split large monolithic workspaces. A workspace with 500+ resources will always be throttle-prone. Use workspace decomposition — separate workspaces for networking, compute, iam, data — and orchestrate them with Terragrunt or Atlantis. This limits the blast radius of any single refresh operation to ~50–100 resources.

4. Monitor API throttling proactively with CloudWatch:

# Monitor EC2 throttling — alert before Terraform hits it
resource "aws_cloudwatch_metric_alarm" "ec2_throttling" {
  alarm_name          = "terraform-ec2-api-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/EC2"
  period              = "60"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "EC2 API throttling detected — reduce Terraform parallelism"
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
}

5. Use terraform plan -refresh=false in PR validation pipelines. Reserve full state refresh for scheduled nightly drift detection jobs, not every PR. This eliminates the throttling vector entirely from your critical path CI.