Initializing Enclave...

How to Fix Terraform Over-Provisioning: Reducing m5.4xlarge Instance Count and Cloud Cost Overruns

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: LOW | Time to Fix: 20 mins

TL;DR

  • What broke: Terraform plan is statically provisioning 50x m5.4xlarge On-Demand instances — 64 vCPU, 160GB RAM each — with no auto-scaling, no spot strategy, and no right-sizing analysis. Monthly bill exceeds budget by 450%.
  • How to fix it: Replace the static count with an Auto Scaling Group using a mixed instances policy (Spot + On-Demand), downsize to m5.xlarge or m5.2xlarge based on actual workload profiling, and enforce cost guardrails via AWS Budgets + OPA.
  • Fast path: Use our Client-Side Sandbox above to auto-refactor this — paste your aws_instance or aws_autoscaling_group block and get a right-sized, spot-optimized config generated locally without leaking your account IDs.

The Incident (What Does the Error Mean?)

Raw planner warning:

Warning: Plan will provision 50x m5.4xlarge instances.
Estimated monthly cloud cost exceeds budget threshold by 450%.
Over-provisioning detected.

Immediate consequence: At $0.768/hr per m5.4xlarge On-Demand (us-east-1), 50 instances running 730 hrs/month = **$28,000/month** for compute alone — before EBS, data transfer, NAT, or LB costs. If your budget threshold was $5,000, you are looking at a $23,000+ monthly overrun on a single terraform apply.

This is almost always caused by one of three things: a count variable that wasn't parameterized and defaulted to a large number, a copy-paste from a load test environment, or a missing environment-specific tfvars file that should have overridden the instance count to 2 or 3 for non-prod.


The Attack Vector / Blast Radius

This isn't a security misconfig — it's a financial blast radius event. The cascading failure path:

  1. terraform apply runs unattended in CI/CD (common in GitOps pipelines with auto-approve). No human sees the cost estimate before apply.
  2. 50x m5.4xlarge spin up within 3-5 minutes. AWS does not block this. Service Quotas may not block this if your account has elevated EC2 limits.
  3. AWS Budgets alert fires — but only after the first billing period snapshot, which can be 24 hours later. By then, the instances have been running.
  4. Orphaned compute: If this was a failed deployment, the ASG or instances may not be in any service mesh or load balancer. They run idle, burning $28K/month doing nothing.
  5. Quota exhaustion side effect: Spinning 50x m5.4xlarge consumes 3,200 vCPUs. This can exhaust your regional vCPU quota, blocking legitimate scaling events for production workloads running in the same region.

The m5.4xlarge selection itself is a secondary problem. Unless your p99 memory utilization is above 100GB per node (validated by CloudWatch or Datadog), this instance type is almost certainly wrong. Most web/API workloads that teams accidentally over-provision on m5.4xlarge actually profile to m5.xlarge (4 vCPU / 16GB) under real traffic.


How to Fix It (The Solution)

Basic Fix — Right-Size the Instance and Cap the Count

If you just need to stop the bleeding immediately:

 resource "aws_instance" "app_server" {
-  count         = 50
-  instance_type = "m5.4xlarge"
+  count         = 3
+  instance_type = "m5.xlarge"
   ami           = var.ami_id
   subnet_id     = var.subnet_id
 }

This is a stopgap only. Static count on aws_instance is the root architectural problem.


Enterprise Best Practice — Auto Scaling Group with Mixed Instances Policy

Replace the static aws_instance block entirely. This is the correct pattern for any workload that had 50 instances in the plan — clearly it needs to scale, just not to 50 all at once.

-resource "aws_instance" "app_server" {
-  count         = 50
-  instance_type = "m5.4xlarge"
-  ami           = var.ami_id
-  subnet_id     = var.subnet_id
-}

+resource "aws_launch_template" "app" {
+  name_prefix   = "app-lt-"
+  image_id      = var.ami_id
+  instance_type = "m5.xlarge"  # baseline; overridden by mixed policy
+
+  monitoring { enabled = true }
+
+  tag_specifications {
+    resource_type = "instance"
+    tags = { Environment = var.environment, CostCenter = var.cost_center }
+  }
+}
+
+resource "aws_autoscaling_group" "app" {
+  name                = "app-asg-${var.environment}"
+  vpc_zone_identifier = var.private_subnet_ids
+  min_size            = 2
+  max_size            = 10   # hard ceiling — never 50
+  desired_capacity    = 2
+
+  mixed_instances_policy {
+    instances_distribution {
+      on_demand_base_capacity                  = 2
+      on_demand_percentage_above_base_capacity = 20
+      spot_allocation_strategy                 = "capacity-optimized"
+    }
+    launch_template {
+      launch_template_specification {
+        launch_template_id = aws_launch_template.app.id
+        version            = "$Latest"
+      }
+      override {
+        instance_type = "m5.xlarge"
+      }
+      override {
+        instance_type = "m5.2xlarge"
+      }
+      override {
+        instance_type = "m4.xlarge"
+      }
+    }
+  }
+
+  tag {
+    key                 = "Name"
+    value               = "app-${var.environment}"
+    propagate_at_launch = true
+  }
+}

Key decisions in this config:

  • max_size = 10 is a hard financial guardrail baked into IaC. Even a runaway scaling policy cannot exceed this.
  • on_demand_base_capacity = 2 ensures you always have 2 stable On-Demand nodes for baseline reliability.
  • Spot handles the remaining 80%+ of capacity at 60-70% cost reduction.
  • Three instance type overrides give the Spot allocator enough diversity to avoid capacity gaps.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

The real fix is ensuring this warning blocks the pipeline rather than printing a warning that gets ignored.

1. Infracost — cost estimation as a PR gate:

# .github/workflows/infracost.yml
- name: Run Infracost
  uses: infracost/actions/setup@v2
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Check cost threshold
  run: |
    infracost breakdown --path=. --format=json --out-file=/tmp/infracost.json
    infracost comment github --path=/tmp/infracost.json \
      --repo=$GITHUB_REPOSITORY \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }} \
      --behavior=update
    # Fail if monthly delta exceeds $2000
    infracost breakdown --path=. --format=json \
      | jq -e '.totalMonthlyCost | tonumber < 2000'

2. OPA/Conftest policy — block large static instance counts:

# policies/no_large_static_count.rego
package terraform.aws

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.after.instance_type == "m5.4xlarge"
  msg := sprintf("Direct aws_instance with m5.4xlarge is prohibited. Use aws_autoscaling_group. Resource: %s", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  count([r | r := input.resource_changes[_]; r.type == "aws_instance"]) > 5
  msg := "Static aws_instance count > 5 is prohibited. Use ASG with mixed instances policy."
}

Run in CI: conftest test tfplan.json --policy policies/

3. Checkov — catch missing max_size guardrails:

checkov -d . --check CKV_AWS_315  # Ensures ASG uses launch templates
# Add custom check for max_size ceiling

4. AWS Service Control Policy (SCP) — account-level hard stop:

{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringEquals": {
      "ec2:InstanceType": ["m5.4xlarge", "m5.8xlarge", "m5.16xlarge"]
    },
    "StringNotEquals": {
      "aws:PrincipalTag/Team": "platform-approved"
    }
  }
}

This SCP blocks any non-approved team from launching large instance types at the AWS Organizations level — Terraform, console, or SDK. It cannot be overridden by IAM policies in member accounts.

Related Diagnostics

"Part of the Cost Utility Matrix."

View all 1 Cost Tools →