Initializing Enclave...

Fixing AWS ASG Cluster Autoscaler 'failed to scale up group: cloudprovider returned error' with Mixed Instances

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: Cluster Autoscaler (CA) called the AWS Auto Scaling API to add capacity to a mixed-instance ASG and got a hard error back — pods are now stuck Pending, your workload is not scaling, and CA is retrying into a backoff loop.
  • How to fix it: The root cause is almost always one of three things: a stale/pinned Launch Template version, missing IAM permissions (autoscaling:CreateOrUpdateTags, ec2:RunInstances on the LT), or a malformed MixedInstancesPolicy with an instance type that's been retired or is unavailable in your AZ.
  • Sandbox fix: Use our Client-Side Sandbox below to auto-refactor your CA Helm values or ASG Terraform config — paste the failing config and get a corrected diff without sending your ARNs anywhere.

The Incident (What Does the Error Mean?)

Your CA logs look like this:

E1107 03:14:22.841271       1 scale_up.go:453] Failed to scale up group
  aws:///us-east-1a/my-mixed-asg: cloudprovider returned error:
  ValidationError: Invalid IAM Instance Profile name
  status code: 400, request id: a1b2c3d4-...

or this variant:

E1107 03:14:22.841271       1 scale_up.go:453] Failed to scale up group
  aws:///us-east-1b/my-mixed-asg: cloudprovider returned error:
  You have requested more vCPU capacity than your current vCPU limit
  of 32 allows for the instance bucket that the specified instance
  type belongs to. (Service: AmazonEC2)

or the silent killer:

W1107 03:14:23.100000       1 clusterstate.go:288] Failed to find
  template node for node group aws:///us-east-1c/my-mixed-asg

Immediate consequence: CA marks the node group as unhealthy, backs off exponentially, and stops attempting scale-up. Pods remain Pending. HPA is useless. Your SLO is burning.


The Attack Vector / Blast Radius

This is a silent cascading failure. CA does not crash — it keeps running, reporting healthy in its own metrics, while your application starves for nodes. The blast radius:

  1. Pending pods accumulate. Deployments that need burst capacity (batch jobs, traffic spikes) queue indefinitely. No error surfaces to app teams — just latency degradation.
  2. Mixed instance policy makes diagnosis harder. With 4–8 instance types in the pool, AWS may reject only one type (e.g., c5.2xlarge retired from a specific AZ), but CA treats the entire node group as failed.
  3. Launch Template version drift. If your ASG references $Default or $Latest and your Terraform/CDK pipeline pushed a new LT version that broke the IAM instance profile ARN format (name vs. ARN mismatch), every scale-up attempt fails with a 400.
  4. vCPU quota exhaustion. On-demand vCPU limits per instance family are per-account, per-region. A mixed-instance policy that over-indexes on c5 or m5 will silently hit the quota wall with no prior warning.
  5. AZ imbalance. CA tries a specific AZ. If that AZ has no capacity for your requested instance types (spot or on-demand), the error propagates up as a generic cloudprovider error.

How to Fix It

Step 0: Isolate the Exact Error

kubectl logs -n kube-system \
  -l app=cluster-autoscaler \
  --since=1h | grep -E "(ERROR|cloudprovider|scale_up|ValidationError|Invalid)"

Match your error to one of the three root causes below.


Root Cause 1: Launch Template Version Conflict

Bad: ASG pinned to a deleted or malformed LT version, or using $Latest which resolved to a broken version.

# Terraform: aws_autoscaling_group mixed_instances_policy
  launch_template {
    launch_template_specification {
      launch_template_id = aws_launch_template.eks_node.id
-     version            = "$Latest"
+     version            = aws_launch_template.eks_node.latest_version
    }
  }

Enterprise Best Practice: Pin to a specific version managed by your pipeline. Use aws_launch_template.eks_node.latest_version as a Terraform reference so it updates atomically with the template, and gate LT changes behind a terraform plan approval step.


Root Cause 2: IAM Permission Gap

CA's node group IAM role or the EC2 instance profile is missing permissions. The minimum required policy for mixed-instance ASGs:

# IAM Policy for Cluster Autoscaler
  {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "autoscaling:DescribeAutoScalingGroups",
          "autoscaling:DescribeAutoScalingInstances",
          "autoscaling:DescribeLaunchConfigurations",
          "autoscaling:DescribeScalingActivities",
-         "autoscaling:SetDesiredCapacity",
-         "autoscaling:TerminateInstanceInAutoScalingGroup"
+         "autoscaling:SetDesiredCapacity",
+         "autoscaling:TerminateInstanceInAutoScalingGroup",
+         "autoscaling:CreateOrUpdateTags",
+         "ec2:DescribeLaunchTemplateVersions",
+         "ec2:DescribeInstanceTypes",
+         "ec2:RunInstances",
+         "iam:PassRole"
        ],
        "Resource": "*"
      }
    ]
  }

Enterprise Best Practice: Scope ec2:RunInstances and iam:PassRole to specific resource ARNs using condition keys:

+     "Condition": {
+       "ArnLike": {
+         "ec2:LaunchTemplate": "arn:aws:ec2:us-east-1:123456789012:launch-template/lt-*"
+       }
+     }

Root Cause 3: Mixed Instances Policy — Unavailable Instance Type or AZ

# aws_autoscaling_group mixed_instances_policy
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.eks_node.id
        version            = aws_launch_template.eks_node.latest_version
      }
      override {
-       instance_type = "c5.2xlarge"   # retired from us-east-1b
+       instance_type = "c5.2xlarge"
+     }
+     override {
+       instance_type = "c5a.2xlarge"  # AMD equivalent, broader AZ availability
      }
      override {
        instance_type = "c6i.2xlarge"
      }
+     override {
+       instance_type = "c6a.2xlarge"
+     }
    }
  }

Also verify your CA Helm values explicitly enable mixed-instance node groups:

# cluster-autoscaler Helm values
  extraArgs:
    balance-similar-node-groups: "true"
    skip-nodes-with-system-pods: "false"
+   aws-use-static-instance-list: "false"   # forces CA to query EC2 API for live instance data
+   expander: "least-waste"

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov — Catch LT Version Pinning at PR Time

# .checkov.yml
checks:
  - CKV_AWS_25   # Ensure ASG uses launch template, not launch config
  - CKV_AWS_315  # Ensure EC2 launch template not using latest version implicitly

2. OPA/Conftest Policy — Enforce Mixed Instance Diversity

# policy/asg_mixed_instances.rego
package asg

deny[msg] {
  resource := input.resource.aws_autoscaling_group[_]
  count(resource.mixed_instances_policy[_].launch_template[_].override) < 3
  msg := "Mixed instance ASG must define at least 3 instance type overrides for resilience."
}

3. Terraform Sentinel — Block $Latest in Production

# sentinel/enforce_lt_version.sentinel
import "tfplan/v2" as tfplan

main = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is "aws_autoscaling_group" implies
    rc.change.after.mixed_instances_policy[0].launch_template[0]
      .launch_template_specification[0].version != "$Latest"
  }
}

4. CloudWatch Alarm on CA Scale-Up Failures

aws cloudwatch put-metric-alarm \
  --alarm-name "CA-ScaleUp-Failure" \
  --metric-name "cluster_autoscaler_failed_scale_ups_total" \
  --namespace "ContainerInsights" \
  --statistic Sum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:oncall-pagerduty

5. Pre-Deployment ASG Validation Script

#!/bin/bash
# Validate all instance types in mixed policy are available in target AZs
INSTANCE_TYPES=("c5.2xlarge" "c5a.2xlarge" "c6i.2xlarge")
AZS=("us-east-1a" "us-east-1b" "us-east-1c")

for TYPE in "${INSTANCE_TYPES[@]}"; do
  for AZ in "${AZS[@]}"; do
    RESULT=$(aws ec2 describe-instance-type-offerings \
      --location-type availability-zone \
      --filters Name=location,Values=$AZ \
               Name=instance-type,Values=$TYPE \
      --query 'InstanceTypeOfferings[0].InstanceType' \
      --output text 2>/dev/null)
    if [ "$RESULT" == "None" ] || [ -z "$RESULT" ]; then
      echo "WARN: $TYPE not available in $AZ — remove from mixed policy"
    fi
  done
done

Run this in your pipeline before terraform apply. A 30-second check that prevents a 3am page.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →