Fixing AWS ASG Cluster Autoscaler 'failed to scale up group: cloudprovider returned error' with Mixed Instances
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: Cluster Autoscaler (CA) called the AWS Auto Scaling API to add capacity to a mixed-instance ASG and got a hard error back — pods are now stuck
Pending, your workload is not scaling, and CA is retrying into a backoff loop. - How to fix it: The root cause is almost always one of three things: a stale/pinned Launch Template version, missing IAM permissions (
autoscaling:CreateOrUpdateTags,ec2:RunInstanceson the LT), or a malformedMixedInstancesPolicywith an instance type that's been retired or is unavailable in your AZ. - Sandbox fix: Use our Client-Side Sandbox below to auto-refactor your CA Helm values or ASG Terraform config — paste the failing config and get a corrected diff without sending your ARNs anywhere.
The Incident (What Does the Error Mean?)
Your CA logs look like this:
E1107 03:14:22.841271 1 scale_up.go:453] Failed to scale up group
aws:///us-east-1a/my-mixed-asg: cloudprovider returned error:
ValidationError: Invalid IAM Instance Profile name
status code: 400, request id: a1b2c3d4-...
or this variant:
E1107 03:14:22.841271 1 scale_up.go:453] Failed to scale up group
aws:///us-east-1b/my-mixed-asg: cloudprovider returned error:
You have requested more vCPU capacity than your current vCPU limit
of 32 allows for the instance bucket that the specified instance
type belongs to. (Service: AmazonEC2)
or the silent killer:
W1107 03:14:23.100000 1 clusterstate.go:288] Failed to find
template node for node group aws:///us-east-1c/my-mixed-asg
Immediate consequence: CA marks the node group as unhealthy, backs off exponentially, and stops attempting scale-up. Pods remain Pending. HPA is useless. Your SLO is burning.
The Attack Vector / Blast Radius
This is a silent cascading failure. CA does not crash — it keeps running, reporting healthy in its own metrics, while your application starves for nodes. The blast radius:
- Pending pods accumulate. Deployments that need burst capacity (batch jobs, traffic spikes) queue indefinitely. No error surfaces to app teams — just latency degradation.
- Mixed instance policy makes diagnosis harder. With 4–8 instance types in the pool, AWS may reject only one type (e.g.,
c5.2xlargeretired from a specific AZ), but CA treats the entire node group as failed. - Launch Template version drift. If your ASG references
$Defaultor$Latestand your Terraform/CDK pipeline pushed a new LT version that broke the IAM instance profile ARN format (name vs. ARN mismatch), every scale-up attempt fails with a 400. - vCPU quota exhaustion. On-demand vCPU limits per instance family are per-account, per-region. A mixed-instance policy that over-indexes on
c5orm5will silently hit the quota wall with no prior warning. - AZ imbalance. CA tries a specific AZ. If that AZ has no capacity for your requested instance types (spot or on-demand), the error propagates up as a generic cloudprovider error.
How to Fix It
Step 0: Isolate the Exact Error
kubectl logs -n kube-system \
-l app=cluster-autoscaler \
--since=1h | grep -E "(ERROR|cloudprovider|scale_up|ValidationError|Invalid)"
Match your error to one of the three root causes below.
Root Cause 1: Launch Template Version Conflict
Bad: ASG pinned to a deleted or malformed LT version, or using $Latest which resolved to a broken version.
# Terraform: aws_autoscaling_group mixed_instances_policy
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.eks_node.id
- version = "$Latest"
+ version = aws_launch_template.eks_node.latest_version
}
}
Enterprise Best Practice: Pin to a specific version managed by your pipeline. Use aws_launch_template.eks_node.latest_version as a Terraform reference so it updates atomically with the template, and gate LT changes behind a terraform plan approval step.
Root Cause 2: IAM Permission Gap
CA's node group IAM role or the EC2 instance profile is missing permissions. The minimum required policy for mixed-instance ASGs:
# IAM Policy for Cluster Autoscaler
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
- "autoscaling:SetDesiredCapacity",
- "autoscaling:TerminateInstanceInAutoScalingGroup"
+ "autoscaling:SetDesiredCapacity",
+ "autoscaling:TerminateInstanceInAutoScalingGroup",
+ "autoscaling:CreateOrUpdateTags",
+ "ec2:DescribeLaunchTemplateVersions",
+ "ec2:DescribeInstanceTypes",
+ "ec2:RunInstances",
+ "iam:PassRole"
],
"Resource": "*"
}
]
}
Enterprise Best Practice: Scope ec2:RunInstances and iam:PassRole to specific resource ARNs using condition keys:
+ "Condition": {
+ "ArnLike": {
+ "ec2:LaunchTemplate": "arn:aws:ec2:us-east-1:123456789012:launch-template/lt-*"
+ }
+ }
Root Cause 3: Mixed Instances Policy — Unavailable Instance Type or AZ
# aws_autoscaling_group mixed_instances_policy
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 1
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.eks_node.id
version = aws_launch_template.eks_node.latest_version
}
override {
- instance_type = "c5.2xlarge" # retired from us-east-1b
+ instance_type = "c5.2xlarge"
+ }
+ override {
+ instance_type = "c5a.2xlarge" # AMD equivalent, broader AZ availability
}
override {
instance_type = "c6i.2xlarge"
}
+ override {
+ instance_type = "c6a.2xlarge"
+ }
}
}
Also verify your CA Helm values explicitly enable mixed-instance node groups:
# cluster-autoscaler Helm values
extraArgs:
balance-similar-node-groups: "true"
skip-nodes-with-system-pods: "false"
+ aws-use-static-instance-list: "false" # forces CA to query EC2 API for live instance data
+ expander: "least-waste"
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Catch LT Version Pinning at PR Time
# .checkov.yml
checks:
- CKV_AWS_25 # Ensure ASG uses launch template, not launch config
- CKV_AWS_315 # Ensure EC2 launch template not using latest version implicitly
2. OPA/Conftest Policy — Enforce Mixed Instance Diversity
# policy/asg_mixed_instances.rego
package asg
deny[msg] {
resource := input.resource.aws_autoscaling_group[_]
count(resource.mixed_instances_policy[_].launch_template[_].override) < 3
msg := "Mixed instance ASG must define at least 3 instance type overrides for resilience."
}
3. Terraform Sentinel — Block $Latest in Production
# sentinel/enforce_lt_version.sentinel
import "tfplan/v2" as tfplan
main = rule {
all tfplan.resource_changes as _, rc {
rc.type is "aws_autoscaling_group" implies
rc.change.after.mixed_instances_policy[0].launch_template[0]
.launch_template_specification[0].version != "$Latest"
}
}
4. CloudWatch Alarm on CA Scale-Up Failures
aws cloudwatch put-metric-alarm \
--alarm-name "CA-ScaleUp-Failure" \
--metric-name "cluster_autoscaler_failed_scale_ups_total" \
--namespace "ContainerInsights" \
--statistic Sum \
--period 300 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789012:oncall-pagerduty
5. Pre-Deployment ASG Validation Script
#!/bin/bash
# Validate all instance types in mixed policy are available in target AZs
INSTANCE_TYPES=("c5.2xlarge" "c5a.2xlarge" "c6i.2xlarge")
AZS=("us-east-1a" "us-east-1b" "us-east-1c")
for TYPE in "${INSTANCE_TYPES[@]}"; do
for AZ in "${AZS[@]}"; do
RESULT=$(aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=location,Values=$AZ \
Name=instance-type,Values=$TYPE \
--query 'InstanceTypeOfferings[0].InstanceType' \
--output text 2>/dev/null)
if [ "$RESULT" == "None" ] || [ -z "$RESULT" ]; then
echo "WARN: $TYPE not available in $AZ — remove from mixed policy"
fi
done
done
Run this in your pipeline before terraform apply. A 30-second check that prevents a 3am page.