Why did Terraform plan 50 instances instead of the expected count?

Almost always a variable resolution failure. The `count` parameter was set to a variable (e.g., `var.instance_count`) that was not supplied via a `-var-file` flag or environment-specific `terraform.tfvars`. Terraform fell back to a default value of 50 set in `variables.tf`, which was likely copied from a load-test or capacity-planning environment. Always validate: `terraform plan -var-file=envs/prod.tfvars` and never set large numbers as variable defaults.

Is m5.4xlarge ever the right instance type for application workloads?

Rarely for general web/API workloads. m5.4xlarge (16 vCPU / 64GB RAM) is appropriate for in-memory caches, ML inference, or high-throughput data processing nodes. For most application tiers, profile your actual p95 CPU and memory utilization in CloudWatch or Datadog first. The majority of teams that reach for m5.4xlarge find their workload fits comfortably on m5.xlarge or m5.2xlarge, reducing per-instance cost by 75-87%.

How do I set a hard cost ceiling so Terraform apply never exceeds budget again?

Three layers: (1) Infracost in CI/CD as a PR gate with a `jq` assertion that fails the pipeline if estimated monthly cost exceeds your threshold. (2) An OPA/Conftest policy that denies plans with static `aws_instance` counts above a safe ceiling. (3) An AWS Budget with an SNS alert AND an IAM action that attaches a deny policy to the deployer role if spend exceeds threshold — this is the only mechanism that can stop an already-running deployment mid-flight.

How to Fix Terraform Over-Provisioning: Reducing m5.4xlarge Instance Count and Cloud Cost Overruns

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: LOW | Time to Fix: 20 mins

TL;DR

What broke: Terraform plan is statically provisioning 50x m5.4xlarge On-Demand instances — 64 vCPU, 160GB RAM each — with no auto-scaling, no spot strategy, and no right-sizing analysis. Monthly bill exceeds budget by 450%.
How to fix it: Replace the static count with an Auto Scaling Group using a mixed instances policy (Spot + On-Demand), downsize to m5.xlarge or m5.2xlarge based on actual workload profiling, and enforce cost guardrails via AWS Budgets + OPA.
Fast path: Use our Client-Side Sandbox above to auto-refactor this — paste your aws_instance or aws_autoscaling_group block and get a right-sized, spot-optimized config generated locally without leaking your account IDs.

The Incident (What Does the Error Mean?)

Raw planner warning:

Warning: Plan will provision 50x m5.4xlarge instances.
Estimated monthly cloud cost exceeds budget threshold by 450%.
Over-provisioning detected.

Immediate consequence: At $0.768/hr per m5.4xlarge On-Demand (us-east-1), 50 instances running 730 hrs/month = **$28,000/month** for compute alone — before EBS, data transfer, NAT, or LB costs. If your budget threshold was $5,000, you are looking at a $23,000+ monthly overrun on a single terraform apply.

This is almost always caused by one of three things: a count variable that wasn't parameterized and defaulted to a large number, a copy-paste from a load test environment, or a missing environment-specific tfvars file that should have overridden the instance count to 2 or 3 for non-prod.

The Attack Vector / Blast Radius

This isn't a security misconfig — it's a financial blast radius event. The cascading failure path:

terraform apply runs unattended in CI/CD (common in GitOps pipelines with auto-approve). No human sees the cost estimate before apply.
50x m5.4xlarge spin up within 3-5 minutes. AWS does not block this. Service Quotas may not block this if your account has elevated EC2 limits.
AWS Budgets alert fires — but only after the first billing period snapshot, which can be 24 hours later. By then, the instances have been running.
Orphaned compute: If this was a failed deployment, the ASG or instances may not be in any service mesh or load balancer. They run idle, burning $28K/month doing nothing.
Quota exhaustion side effect: Spinning 50x m5.4xlarge consumes 3,200 vCPUs. This can exhaust your regional vCPU quota, blocking legitimate scaling events for production workloads running in the same region.

The m5.4xlarge selection itself is a secondary problem. Unless your p99 memory utilization is above 100GB per node (validated by CloudWatch or Datadog), this instance type is almost certainly wrong. Most web/API workloads that teams accidentally over-provision on m5.4xlarge actually profile to m5.xlarge (4 vCPU / 16GB) under real traffic.

How to Fix It (The Solution)

Basic Fix — Right-Size the Instance and Cap the Count

If you just need to stop the bleeding immediately:

 resource "aws_instance" "app_server" {
-  count         = 50
-  instance_type = "m5.4xlarge"
+  count         = 3
+  instance_type = "m5.xlarge"
   ami           = var.ami_id
   subnet_id     = var.subnet_id
 }

This is a stopgap only. Static count on aws_instance is the root architectural problem.

Enterprise Best Practice — Auto Scaling Group with Mixed Instances Policy

Replace the static aws_instance block entirely. This is the correct pattern for any workload that had 50 instances in the plan — clearly it needs to scale, just not to 50 all at once.

-resource "aws_instance" "app_server" {
-  count         = 50
-  instance_type = "m5.4xlarge"
-  ami           = var.ami_id
-  subnet_id     = var.subnet_id
-}

+resource "aws_launch_template" "app" {
+  name_prefix   = "app-lt-"
+  image_id      = var.ami_id
+  instance_type = "m5.xlarge"  # baseline; overridden by mixed policy
+
+  monitoring { enabled = true }
+
+  tag_specifications {
+    resource_type = "instance"
+    tags = { Environment = var.environment, CostCenter = var.cost_center }
+  }
+}
+
+resource "aws_autoscaling_group" "app" {
+  name                = "app-asg-${var.environment}"
+  vpc_zone_identifier = var.private_subnet_ids
+  min_size            = 2
+  max_size            = 10   # hard ceiling — never 50
+  desired_capacity    = 2
+
+  mixed_instances_policy {
+    instances_distribution {
+      on_demand_base_capacity                  = 2
+      on_demand_percentage_above_base_capacity = 20
+      spot_allocation_strategy                 = "capacity-optimized"
+    }
+    launch_template {
+      launch_template_specification {
+        launch_template_id = aws_launch_template.app.id
+        version            = "$Latest"
+      }
+      override {
+        instance_type = "m5.xlarge"
+      }
+      override {
+        instance_type = "m5.2xlarge"
+      }
+      override {
+        instance_type = "m4.xlarge"
+      }
+    }
+  }
+
+  tag {
+    key                 = "Name"
+    value               = "app-${var.environment}"
+    propagate_at_launch = true
+  }
+}

Key decisions in this config:

max_size = 10 is a hard financial guardrail baked into IaC. Even a runaway scaling policy cannot exceed this.
on_demand_base_capacity = 2 ensures you always have 2 stable On-Demand nodes for baseline reliability.
Spot handles the remaining 80%+ of capacity at 60-70% cost reduction.
Three instance type overrides give the Spot allocator enough diversity to avoid capacity gaps.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

The real fix is ensuring this warning blocks the pipeline rather than printing a warning that gets ignored.

1. Infracost — cost estimation as a PR gate:

# .github/workflows/infracost.yml
- name: Run Infracost
  uses: infracost/actions/setup@v2
  with:
    api-key: ${{ secrets.INFRACOST_API_KEY }}

- name: Check cost threshold
  run: |
    infracost breakdown --path=. --format=json --out-file=/tmp/infracost.json
    infracost comment github --path=/tmp/infracost.json \
      --repo=$GITHUB_REPOSITORY \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }} \
      --behavior=update
    # Fail if monthly delta exceeds $2000
    infracost breakdown --path=. --format=json \
      | jq -e '.totalMonthlyCost | tonumber < 2000'

2. OPA/Conftest policy — block large static instance counts:

# policies/no_large_static_count.rego
package terraform.aws

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.after.instance_type == "m5.4xlarge"
  msg := sprintf("Direct aws_instance with m5.4xlarge is prohibited. Use aws_autoscaling_group. Resource: %s", [resource.address])
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  count([r | r := input.resource_changes[_]; r.type == "aws_instance"]) > 5
  msg := "Static aws_instance count > 5 is prohibited. Use ASG with mixed instances policy."
}

Run in CI: conftest test tfplan.json --policy policies/

3. Checkov — catch missing max_size guardrails:

checkov -d . --check CKV_AWS_315  # Ensures ASG uses launch templates
# Add custom check for max_size ceiling

4. AWS Service Control Policy (SCP) — account-level hard stop:

{
  "Effect": "Deny",
  "Action": "ec2:RunInstances",
  "Resource": "arn:aws:ec2:*:*:instance/*",
  "Condition": {
    "StringEquals": {
      "ec2:InstanceType": ["m5.4xlarge", "m5.8xlarge", "m5.16xlarge"]
    },
    "StringNotEquals": {
      "aws:PrincipalTag/Team": "platform-approved"
    }
  }
}

This SCP blocks any non-approved team from launching large instance types at the AWS Organizations level — Terraform, console, or SDK. It cannot be overridden by IAM policies in member accounts.