How to Fix Terraform 'Error: Failed to refresh state' API Rate Limit Errors
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke:
terraform refresh(orplan/applywith implicit refresh) is issuing parallelDescribe*/Get*/List*API calls across hundreds of resources simultaneously, exceeding your cloud provider's per-second or burst rate limit — Terraform aborts the entire state refresh, leavingterraform.tfstatepotentially stale. - How to fix it: Reduce Terraform's default parallelism from
10to3–5, add provider-level retry/backoff configuration, and surgically use-targetto scope refreshes to only the affected resource graph. - Use our Client-Side Sandbox above to paste your provider block and resource count — it will auto-generate the tuned configuration with correct parallelism and retry settings.
The Incident (What Does the Error Mean?)
Raw error output from a live AWS environment:
Error: Failed to refresh state
Causes:
- aws_instance.web[42]: error reading EC2 Instance (i-0abc123def456789):
RequestError: send request failed
caused by: Post "https://ec2.us-east-1.amazonaws.com/":
ThrottlingException: Rate exceeded
status code: 400, request id: a1b2c3d4-...
- aws_security_group.app[17]: error reading Security Group (sg-0def987abc):
ThrottlingException: Rate exceeded
- aws_iam_role.lambda_exec[3]: error reading IAM Role (lambda-exec-prod):
Throttling: Rate exceeded
status code: 400
Immediate consequence: Terraform cannot reconcile the difference between your .tfstate file and the actual cloud state. Every subsequent plan or apply will either fail outright or operate on stale state — meaning Terraform may attempt to recreate resources that already exist, or skip drift that genuinely occurred. In a production environment, this is a silent correctness failure, not just a slowdown.
The Attack Vector / Blast Radius
Terraform's default parallelism = 10 means it fires 10 concurrent API calls per resource operation. On a state with 200+ resources — common in any non-trivial AWS account — a single terraform refresh generates 2,000+ API calls in under 60 seconds across EC2, IAM, RDS, S3, and VPC APIs simultaneously.
AWS API rate limits are per-service, per-region, per-account. EC2 DescribeInstances allows ~100 req/s. IAM GetRole allows ~20 req/s. When Terraform floods all of these simultaneously:
- Partial refresh failure: Some resources refresh successfully, others don't. Your state is now partially stale — the worst possible outcome because Terraform believes it has accurate state when it doesn't.
- Cascading plan corruption: A stale
aws_security_groupstate causes Terraform to generate a diff that destroys and recreates the SG, which cascades to every EC2 instance referencing it. - CI/CD pipeline deadlock: In a pipeline with
terraform planon every PR, throttling errors cause non-deterministic failures. Engineers re-run pipelines, which compounds the throttling, triggering AWS account-level throttling that affects other services and teams. - State lock contention: If using remote state (S3 + DynamoDB), a failed refresh that doesn't cleanly release the state lock forces manual
terraform force-unlockoperations — a dangerous intervention in production.
The blast radius in a multi-workspace monorepo is severe. One throttled workspace can exhaust the API budget for the entire AWS account, breaking deployments in unrelated workspaces.
How to Fix It (The Solution)
Basic Fix — Reduce Parallelism at CLI Level
Drop parallelism immediately to stop the bleeding:
# Instead of:
terraform refresh
# Use:
terraform refresh -parallelism=3
# Or for plan/apply:
terraform plan -parallelism=3 -refresh=true
For targeted refresh when you know the affected resource:
terraform plan -target=aws_instance.web -target=aws_security_group.app -parallelism=3
Enterprise Best Practice — Provider-Level Retry + Parallelism Configuration
Hardcode the fix at the provider level so it applies to every operation, not just CLI invocations:
# versions.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"- version = "~> 5.0"
+ version = "~> 5.31" # 5.31+ has improved retry backoff
}
}
}
# provider.tf
provider "aws" {
region = var.aws_region
- # No retry configuration — Terraform uses SDK defaults (3 retries, no jitter)
+
+ # Increase max retries with exponential backoff for throttling errors
+ max_retries = 10
+
+ default_tags {
+ tags = local.common_tags
+ }
}
# terraform.tf — root module backend/workspace config
terraform {
backend "s3" {
bucket = "my-tfstate-bucket"
key = "prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
-# No parallelism override — defaults to 10
+# Enforce reduced parallelism for all operations in this workspace
+# Set via TF_CLI_ARGS_plan, TF_CLI_ARGS_apply, TF_CLI_ARGS_refresh env vars
+# in your CI/CD system (GitHub Actions, GitLab CI, etc.):
+# TF_CLI_ARGS_plan="-parallelism=5"
+# TF_CLI_ARGS_apply="-parallelism=5"
+# TF_CLI_ARGS_refresh="-parallelism=5"
# For large state files: disable automatic refresh on plan,
# run explicit targeted refreshes instead
-# Old workflow (hammers all APIs on every plan):
-# terraform plan
+# New workflow — decouple refresh from plan:
+# Step 1: Refresh only changed resource types
+# terraform apply -refresh-only -target=module.compute -parallelism=3
+
+# Step 2: Plan without re-refresh (uses state from step 1)
+# terraform plan -refresh=false
+
+# Step 3: Apply
+# terraform apply -refresh=false
For AWS specifically — request a rate limit increase via Service Quotas for the APIs you're hitting most (DescribeInstances, DescribeSecurityGroups, GetRole). This is a 2-minute console operation and takes effect within minutes for most services.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Enforce parallelism via environment variables in your pipeline — not per-command flags:
# .github/workflows/terraform.yml
env:
TF_CLI_ARGS_plan: "-parallelism=5"
TF_CLI_ARGS_apply: "-parallelism=5"
TF_CLI_ARGS_refresh: "-parallelism=3"
2. Add a Checkov policy to catch missing max_retries on AWS provider:
# checkov/custom_checks/check_aws_provider_retries.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories
class AWSProviderMaxRetries(BaseResourceCheck):
def __init__(self):
name = "Ensure AWS provider has max_retries configured"
id = "CKV_CUSTOM_AWS_001"
supported_resources = ["provider"]
categories = [CheckCategories.GENERAL_SECURITY]
super().__init__(name=name, id=id, categories=categories,
supported_resources=supported_resources)
def scan_resource_conf(self, conf):
if conf.get("name") == "aws":
retries = conf.get("max_retries", [None])
if retries and int(retries[0]) >= 5:
return CheckResult.PASSED
return CheckResult.FAILED
3. Split large monolithic workspaces. A workspace with 500+ resources will always be throttle-prone. Use workspace decomposition — separate workspaces for networking, compute, iam, data — and orchestrate them with Terragrunt or Atlantis. This limits the blast radius of any single refresh operation to ~50–100 resources.
4. Monitor API throttling proactively with CloudWatch:
# Monitor EC2 throttling — alert before Terraform hits it
resource "aws_cloudwatch_metric_alarm" "ec2_throttling" {
alarm_name = "terraform-ec2-api-throttling"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "ThrottledRequests"
namespace = "AWS/EC2"
period = "60"
statistic = "Sum"
threshold = "10"
alarm_description = "EC2 API throttling detected — reduce Terraform parallelism"
alarm_actions = [aws_sns_topic.ops_alerts.arn]
}
5. Use terraform plan -refresh=false in PR validation pipelines. Reserve full state refresh for scheduled nightly drift detection jobs, not every PR. This eliminates the throttling vector entirely from your critical path CI.