Initializing Enclave...

How to Fix Terraform 'Error: Failed to refresh state' API Rate Limit Errors

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: terraform refresh (or plan/apply with implicit refresh) is issuing parallel Describe*/Get*/List* API calls across hundreds of resources simultaneously, exceeding your cloud provider's per-second or burst rate limit — Terraform aborts the entire state refresh, leaving terraform.tfstate potentially stale.
  • How to fix it: Reduce Terraform's default parallelism from 10 to 3–5, add provider-level retry/backoff configuration, and surgically use -target to scope refreshes to only the affected resource graph.
  • Use our Client-Side Sandbox above to paste your provider block and resource count — it will auto-generate the tuned configuration with correct parallelism and retry settings.

The Incident (What Does the Error Mean?)

Raw error output from a live AWS environment:

Error: Failed to refresh state

Causes:
  - aws_instance.web[42]: error reading EC2 Instance (i-0abc123def456789): 
    RequestError: send request failed
    caused by: Post "https://ec2.us-east-1.amazonaws.com/": 
    ThrottlingException: Rate exceeded
    status code: 400, request id: a1b2c3d4-...

  - aws_security_group.app[17]: error reading Security Group (sg-0def987abc): 
    ThrottlingException: Rate exceeded

  - aws_iam_role.lambda_exec[3]: error reading IAM Role (lambda-exec-prod): 
    Throttling: Rate exceeded
    status code: 400

Immediate consequence: Terraform cannot reconcile the difference between your .tfstate file and the actual cloud state. Every subsequent plan or apply will either fail outright or operate on stale state — meaning Terraform may attempt to recreate resources that already exist, or skip drift that genuinely occurred. In a production environment, this is a silent correctness failure, not just a slowdown.


The Attack Vector / Blast Radius

Terraform's default parallelism = 10 means it fires 10 concurrent API calls per resource operation. On a state with 200+ resources — common in any non-trivial AWS account — a single terraform refresh generates 2,000+ API calls in under 60 seconds across EC2, IAM, RDS, S3, and VPC APIs simultaneously.

AWS API rate limits are per-service, per-region, per-account. EC2 DescribeInstances allows ~100 req/s. IAM GetRole allows ~20 req/s. When Terraform floods all of these simultaneously:

  1. Partial refresh failure: Some resources refresh successfully, others don't. Your state is now partially stale — the worst possible outcome because Terraform believes it has accurate state when it doesn't.
  2. Cascading plan corruption: A stale aws_security_group state causes Terraform to generate a diff that destroys and recreates the SG, which cascades to every EC2 instance referencing it.
  3. CI/CD pipeline deadlock: In a pipeline with terraform plan on every PR, throttling errors cause non-deterministic failures. Engineers re-run pipelines, which compounds the throttling, triggering AWS account-level throttling that affects other services and teams.
  4. State lock contention: If using remote state (S3 + DynamoDB), a failed refresh that doesn't cleanly release the state lock forces manual terraform force-unlock operations — a dangerous intervention in production.

The blast radius in a multi-workspace monorepo is severe. One throttled workspace can exhaust the API budget for the entire AWS account, breaking deployments in unrelated workspaces.


How to Fix It (The Solution)

Basic Fix — Reduce Parallelism at CLI Level

Drop parallelism immediately to stop the bleeding:

# Instead of:
terraform refresh

# Use:
terraform refresh -parallelism=3

# Or for plan/apply:
terraform plan -parallelism=3 -refresh=true

For targeted refresh when you know the affected resource:

terraform plan -target=aws_instance.web -target=aws_security_group.app -parallelism=3

Enterprise Best Practice — Provider-Level Retry + Parallelism Configuration

Hardcode the fix at the provider level so it applies to every operation, not just CLI invocations:

# versions.tf
 terraform {
   required_providers {
     aws = {
       source  = "hashicorp/aws"-      version = "~> 5.0"
+      version = "~> 5.31" # 5.31+ has improved retry backoff
     }
   }
 }

# provider.tf
 provider "aws" {
   region = var.aws_region
-  # No retry configuration — Terraform uses SDK defaults (3 retries, no jitter)
+
+  # Increase max retries with exponential backoff for throttling errors
+  max_retries = 10
+
+  default_tags {
+    tags = local.common_tags
+  }
 }
# terraform.tf — root module backend/workspace config
 terraform {
   backend "s3" {
     bucket         = "my-tfstate-bucket"
     key            = "prod/terraform.tfstate"
     region         = "us-east-1"
     dynamodb_table = "terraform-state-lock"
     encrypt        = true
   }
 }

-# No parallelism override — defaults to 10
+# Enforce reduced parallelism for all operations in this workspace
+# Set via TF_CLI_ARGS_plan, TF_CLI_ARGS_apply, TF_CLI_ARGS_refresh env vars
+# in your CI/CD system (GitHub Actions, GitLab CI, etc.):
+# TF_CLI_ARGS_plan="-parallelism=5"
+# TF_CLI_ARGS_apply="-parallelism=5"
+# TF_CLI_ARGS_refresh="-parallelism=5"
# For large state files: disable automatic refresh on plan,
# run explicit targeted refreshes instead

-# Old workflow (hammers all APIs on every plan):
-# terraform plan

+# New workflow — decouple refresh from plan:
+# Step 1: Refresh only changed resource types
+# terraform apply -refresh-only -target=module.compute -parallelism=3
+
+# Step 2: Plan without re-refresh (uses state from step 1)
+# terraform plan -refresh=false
+
+# Step 3: Apply
+# terraform apply -refresh=false

For AWS specifically — request a rate limit increase via Service Quotas for the APIs you're hitting most (DescribeInstances, DescribeSecurityGroups, GetRole). This is a 2-minute console operation and takes effect within minutes for most services.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Enforce parallelism via environment variables in your pipeline — not per-command flags:

# .github/workflows/terraform.yml
env:
  TF_CLI_ARGS_plan: "-parallelism=5"
  TF_CLI_ARGS_apply: "-parallelism=5"
  TF_CLI_ARGS_refresh: "-parallelism=3"

2. Add a Checkov policy to catch missing max_retries on AWS provider:

# checkov/custom_checks/check_aws_provider_retries.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories

class AWSProviderMaxRetries(BaseResourceCheck):
    def __init__(self):
        name = "Ensure AWS provider has max_retries configured"
        id = "CKV_CUSTOM_AWS_001"
        supported_resources = ["provider"]
        categories = [CheckCategories.GENERAL_SECURITY]
        super().__init__(name=name, id=id, categories=categories,
                         supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        if conf.get("name") == "aws":
            retries = conf.get("max_retries", [None])
            if retries and int(retries[0]) >= 5:
                return CheckResult.PASSED
        return CheckResult.FAILED

3. Split large monolithic workspaces. A workspace with 500+ resources will always be throttle-prone. Use workspace decomposition — separate workspaces for networking, compute, iam, data — and orchestrate them with Terragrunt or Atlantis. This limits the blast radius of any single refresh operation to ~50–100 resources.

4. Monitor API throttling proactively with CloudWatch:

# Monitor EC2 throttling — alert before Terraform hits it
resource "aws_cloudwatch_metric_alarm" "ec2_throttling" {
  alarm_name          = "terraform-ec2-api-throttling"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "ThrottledRequests"
  namespace           = "AWS/EC2"
  period              = "60"
  statistic           = "Sum"
  threshold           = "10"
  alarm_description   = "EC2 API throttling detected — reduce Terraform parallelism"
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
}

5. Use terraform plan -refresh=false in PR validation pipelines. Reserve full state refresh for scheduled nightly drift detection jobs, not every PR. This eliminates the throttling vector entirely from your critical path CI.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →