Fixing AKS 'Failed to Get Node Info' During NodePool Scale-Up: Root Cause & Resolution
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause
TL;DR
- What broke: AKS cluster autoscaler or manual scale-up triggered new node provisioning, but the kubelet on new nodes failed to register —
failed to get node infosurfaces inkube-controller-managerlogs, halting pod scheduling on the new capacity. - How to fix it: Identify whether the failure is Azure VM quota exhaustion, a broken kubelet bootstrap token/RBAC binding, or a corrupted node object in etcd — then execute the targeted remediation below.
- Use our Client-Side Sandbox above to drop your nodepool spec or
kubectl describe nodeoutput and auto-generate the refactored configuration.
The Incident (What Does the Error Mean?)
Raw error from kube-controller-manager or kubectl get events:
E0601 14:32:11.847201 1 node_lifecycle_controller.go:1494] "Failed to get node info" err="node \"aks-nodepool1-12345678-vmss000003\" not found" node="aks-nodepool1-12345678-vmss000003"
Or from the cluster autoscaler:
scale_up.go:453] Failed to get node info for aks-nodepool1-12345678-vmss000003: nodes "aks-nodepool1-12345678-vmss000003" not found
Immediate consequence: The new VM spun up in the VMSS, the OS booted, but the kubelet never successfully completed node registration with the API server. The node object either never appeared in etcd or appeared in a NotReady state and was immediately evicted. Pending pods remain unscheduled. If this is autoscaler-driven, the autoscaler may enter a backoff loop and stop scaling entirely, silently starving your workload of capacity.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a security exploit — but the blast radius is severe:
- Autoscaler deadlock: Autoscaler marks the scale-up as failed, increments its backoff timer. In high-traffic scenarios, you lose the 10-minute autoscaler backoff window entirely while pods pile up in
Pending. - VMSS billing leak: The Azure VM is running and billing you even though it contributes zero capacity. At scale, a stuck nodepool can orphan 10–50 VMs.
- Root causes that compound each other:
- Azure subscription vCPU quota hit mid-scale — VM provisions but ARM returns a partial success; kubelet has no valid identity to register.
system:nodeRBAC bootstrap binding broken — the kubelet TLS bootstrap process cannot obtain a signed certificate, so it cannot authenticate to the API server toGET /api/v1/nodes/{name}.- Corrupted or stale node object — a previous node with the same VMSS instance name exists in a
Terminatingstate, blocking re-registration. - Custom VNet/NSG blocking port 10250 — kubelet API unreachable, node controller cannot collect node info.
How to Fix It
Step 1: Triage in 60 Seconds
# Check for stuck/ghost node objects
kubectl get nodes | grep -E 'NotReady|Unknown'
# Check autoscaler logs for the specific failure class
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 | grep -i 'failed\|error\|node info'
# Check Azure quota for the region/SKU
az vm list-usage --location eastus --output table | grep -i 'cores'
# Check kubelet bootstrap RBAC
kubectl get clusterrolebinding kubelet-bootstrap
kubectl get clusterrolebinding system:node-proxier
Fix A: Stale Node Object Blocking Re-Registration (Most Common)
- # Stale node object holds the name, new kubelet cannot register
- kubectl get node aks-nodepool1-12345678-vmss000003
- # STATUS: Terminating / Unknown for >10 minutes
+ # Force delete the ghost node object
+ kubectl delete node aks-nodepool1-12345678-vmss000003 --grace-period=0 --force
+
+ # Then cordon + drain any surviving VMSS instance and let autoscaler reprovision
+ az vmss delete-instances \
+ --resource-group MC_myRG_myAKS_eastus \
+ --name aks-nodepool1-12345678-vmss \
+ --instance-ids 3
Fix B: Broken kubelet Bootstrap RBAC (Enterprise Best Practice)
AKS manages this binding, but custom AAD/RBAC configurations or manual cluster modifications can break it.
- # Missing or malformed bootstrap binding
- kubectl get clusterrolebinding kubelet-bootstrap
- # Error: clusterrolebindings.rbac.authorization.k8s.io "kubelet-bootstrap" not found
+ # Restore the bootstrap binding
+ kubectl create clusterrolebinding kubelet-bootstrap \
+ --clusterrole=system:node-bootstrapper \
+ --group=system:bootstrappers
+
+ # Verify node-autoapprove binding exists
+ kubectl get clusterrolebinding system:node-autoapprove-bootstrap
+ # If missing:
+ kubectl apply -f https://raw.githubusercontent.com/Azure/AKS/master/examples/bootstrap/node-autoapprove.yaml
Fix C: Azure vCPU Quota Exhaustion
- # Nodepool scale-up requesting Standard_D4s_v3 (4 vCPUs x 10 nodes = 40 vCPUs)
- # Current quota: 32 vCPUs — silent partial provision, 2 VMs boot without valid ARM identity
+ # Request quota increase (takes 1–24 hrs via support ticket)
+ az support tickets create \
+ --ticket-name "vCPU quota increase eastus" \
+ --title "Increase Standard DSv3 quota" \
+ --contact-country "US" \
+ --contact-email "[email protected]"
+
+ # Immediate workaround: switch nodepool to a SKU with available quota
+ az aks nodepool add \
+ --resource-group myRG \
+ --cluster-name myAKS \
+ --name fallbackpool \
+ --node-vm-size Standard_D4s_v4 \
+ --node-count 5
Fix D: NSG Blocking Kubelet Port 10250
- # NSG rule denying inbound 10250 from API server subnet to node subnet
- {
- "name": "DenyAll",
- "priority": 100,
- "direction": "Inbound",
- "access": "Deny",
- "protocol": "*",
- "sourceAddressPrefix": "*",
- "destinationPortRange": "*"
- }
+ # Add explicit allow rule for AKS control plane to reach kubelet
+ {
+ "name": "AllowAKSControlPlaneKubelet",
+ "priority": 90,
+ "direction": "Inbound",
+ "access": "Allow",
+ "protocol": "Tcp",
+ "sourceAddressPrefix": "AzureCloud.EastUS",
+ "destinationPortRange": "10250"
+ }
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Pre-Scale Quota Gate in your Pipeline
# Azure DevOps / GitHub Actions step before any nodepool scale operation
- name: Check vCPU Quota Before Scale
run: |
AVAILABLE=$(az vm list-usage --location $REGION \
--query "[?localName=='Standard DSv3 Family vCPUs'].currentValue" -o tsv)
LIMIT=$(az vm list-usage --location $REGION \
--query "[?localName=='Standard DSv3 Family vCPUs'].limit" -o tsv)
HEADROOM=$((LIMIT - AVAILABLE))
REQUIRED=$((NODE_COUNT * VCPUS_PER_NODE))
if [ $HEADROOM -lt $REQUIRED ]; then
echo "FATAL: Insufficient vCPU quota. Headroom: $HEADROOM, Required: $REQUIRED"
exit 1
fi
2. OPA/Gatekeeper Policy — Enforce NodePool SKU Allowlist
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedNodeSKUs
metadata:
name: enforce-approved-vm-skus
spec:
match:
kinds:
- apiGroups: ["*"]
kinds: ["Node"]
parameters:
allowedSKUs:
- "Standard_D4s_v3"
- "Standard_D8s_v3"
- "Standard_D4s_v4"
3. Checkov / Terraform Lint for NSG Rules
# Add to your Terraform CI pipeline
checkov -d ./infra/aks \
--check CKV_AZURE_77 \
--check CKV_AZURE_9 \
--compact
# CKV_AZURE_77: Ensure AKS node pools use authorized IP ranges
# CKV_AZURE_9: Ensure NSG does not have unrestricted inbound deny-all before AKS rules
4. Alert on Node Registration Latency
# Prometheus alert — fire if a node stays NotReady > 5 minutes post-provision
- alert: AKSNodeRegistrationStuck
expr: |
(kube_node_status_condition{condition="Ready",status="false"} == 1)
and
(time() - kube_node_created > 300)
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} stuck in NotReady > 5min post-provision"
runbook: "https://your-wiki/aks-node-registration-runbook"