How to Fix Terraform Over-Provisioning: Reducing m5.4xlarge Instance Count and Cloud Cost Overruns
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: LOW | Time to Fix: 20 mins
TL;DR
- What broke: Terraform plan is statically provisioning 50x
m5.4xlargeOn-Demand instances — 64 vCPU, 160GB RAM each — with no auto-scaling, no spot strategy, and no right-sizing analysis. Monthly bill exceeds budget by 450%. - How to fix it: Replace the static
countwith an Auto Scaling Group using a mixed instances policy (Spot + On-Demand), downsize tom5.xlargeorm5.2xlargebased on actual workload profiling, and enforce cost guardrails via AWS Budgets + OPA. - Fast path: Use our Client-Side Sandbox above to auto-refactor this — paste your
aws_instanceoraws_autoscaling_groupblock and get a right-sized, spot-optimized config generated locally without leaking your account IDs.
The Incident (What Does the Error Mean?)
Raw planner warning:
Warning: Plan will provision 50x m5.4xlarge instances.
Estimated monthly cloud cost exceeds budget threshold by 450%.
Over-provisioning detected.
Immediate consequence: At $0.768/hr per $28,000/month** for compute alone — before EBS, data transfer, NAT, or LB costs. If your budget threshold was $5,000, you are looking at a $23,000+ monthly overrun on a single m5.4xlarge On-Demand (us-east-1), 50 instances running 730 hrs/month = **terraform apply.
This is almost always caused by one of three things: a count variable that wasn't parameterized and defaulted to a large number, a copy-paste from a load test environment, or a missing environment-specific tfvars file that should have overridden the instance count to 2 or 3 for non-prod.
The Attack Vector / Blast Radius
This isn't a security misconfig — it's a financial blast radius event. The cascading failure path:
terraform applyruns unattended in CI/CD (common in GitOps pipelines with auto-approve). No human sees the cost estimate before apply.- 50x m5.4xlarge spin up within 3-5 minutes. AWS does not block this. Service Quotas may not block this if your account has elevated EC2 limits.
- AWS Budgets alert fires — but only after the first billing period snapshot, which can be 24 hours later. By then, the instances have been running.
- Orphaned compute: If this was a failed deployment, the ASG or instances may not be in any service mesh or load balancer. They run idle, burning $28K/month doing nothing.
- Quota exhaustion side effect: Spinning 50x m5.4xlarge consumes 3,200 vCPUs. This can exhaust your regional vCPU quota, blocking legitimate scaling events for production workloads running in the same region.
The m5.4xlarge selection itself is a secondary problem. Unless your p99 memory utilization is above 100GB per node (validated by CloudWatch or Datadog), this instance type is almost certainly wrong. Most web/API workloads that teams accidentally over-provision on m5.4xlarge actually profile to m5.xlarge (4 vCPU / 16GB) under real traffic.
How to Fix It (The Solution)
Basic Fix — Right-Size the Instance and Cap the Count
If you just need to stop the bleeding immediately:
resource "aws_instance" "app_server" {
- count = 50
- instance_type = "m5.4xlarge"
+ count = 3
+ instance_type = "m5.xlarge"
ami = var.ami_id
subnet_id = var.subnet_id
}
This is a stopgap only. Static count on aws_instance is the root architectural problem.
Enterprise Best Practice — Auto Scaling Group with Mixed Instances Policy
Replace the static aws_instance block entirely. This is the correct pattern for any workload that had 50 instances in the plan — clearly it needs to scale, just not to 50 all at once.
-resource "aws_instance" "app_server" {
- count = 50
- instance_type = "m5.4xlarge"
- ami = var.ami_id
- subnet_id = var.subnet_id
-}
+resource "aws_launch_template" "app" {
+ name_prefix = "app-lt-"
+ image_id = var.ami_id
+ instance_type = "m5.xlarge" # baseline; overridden by mixed policy
+
+ monitoring { enabled = true }
+
+ tag_specifications {
+ resource_type = "instance"
+ tags = { Environment = var.environment, CostCenter = var.cost_center }
+ }
+}
+
+resource "aws_autoscaling_group" "app" {
+ name = "app-asg-${var.environment}"
+ vpc_zone_identifier = var.private_subnet_ids
+ min_size = 2
+ max_size = 10 # hard ceiling — never 50
+ desired_capacity = 2
+
+ mixed_instances_policy {
+ instances_distribution {
+ on_demand_base_capacity = 2
+ on_demand_percentage_above_base_capacity = 20
+ spot_allocation_strategy = "capacity-optimized"
+ }
+ launch_template {
+ launch_template_specification {
+ launch_template_id = aws_launch_template.app.id
+ version = "$Latest"
+ }
+ override {
+ instance_type = "m5.xlarge"
+ }
+ override {
+ instance_type = "m5.2xlarge"
+ }
+ override {
+ instance_type = "m4.xlarge"
+ }
+ }
+ }
+
+ tag {
+ key = "Name"
+ value = "app-${var.environment}"
+ propagate_at_launch = true
+ }
+}
Key decisions in this config:
max_size = 10is a hard financial guardrail baked into IaC. Even a runaway scaling policy cannot exceed this.on_demand_base_capacity = 2ensures you always have 2 stable On-Demand nodes for baseline reliability.- Spot handles the remaining 80%+ of capacity at 60-70% cost reduction.
- Three instance type overrides give the Spot allocator enough diversity to avoid capacity gaps.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
The real fix is ensuring this warning blocks the pipeline rather than printing a warning that gets ignored.
1. Infracost — cost estimation as a PR gate:
# .github/workflows/infracost.yml
- name: Run Infracost
uses: infracost/actions/setup@v2
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Check cost threshold
run: |
infracost breakdown --path=. --format=json --out-file=/tmp/infracost.json
infracost comment github --path=/tmp/infracost.json \
--repo=$GITHUB_REPOSITORY \
--pull-request=${{ github.event.pull_request.number }} \
--github-token=${{ secrets.GITHUB_TOKEN }} \
--behavior=update
# Fail if monthly delta exceeds $2000
infracost breakdown --path=. --format=json \
| jq -e '.totalMonthlyCost | tonumber < 2000'
2. OPA/Conftest policy — block large static instance counts:
# policies/no_large_static_count.rego
package terraform.aws
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
resource.change.after.instance_type == "m5.4xlarge"
msg := sprintf("Direct aws_instance with m5.4xlarge is prohibited. Use aws_autoscaling_group. Resource: %s", [resource.address])
}
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
count([r | r := input.resource_changes[_]; r.type == "aws_instance"]) > 5
msg := "Static aws_instance count > 5 is prohibited. Use ASG with mixed instances policy."
}
Run in CI: conftest test tfplan.json --policy policies/
3. Checkov — catch missing max_size guardrails:
checkov -d . --check CKV_AWS_315 # Ensures ASG uses launch templates
# Add custom check for max_size ceiling
4. AWS Service Control Policy (SCP) — account-level hard stop:
{
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringEquals": {
"ec2:InstanceType": ["m5.4xlarge", "m5.8xlarge", "m5.16xlarge"]
},
"StringNotEquals": {
"aws:PrincipalTag/Team": "platform-approved"
}
}
}
This SCP blocks any non-approved team from launching large instance types at the AWS Organizations level — Terraform, console, or SDK. It cannot be overridden by IAM policies in member accounts.