Why does 'failed to get node info' only happen during scale-up and not for existing nodes?

Existing nodes have already completed TLS bootstrap and have valid node objects in etcd. The error is specific to the registration handshake — new VMSS instances must obtain a bootstrap token, get a signed kubelet certificate from the API server, and create their node object. Any failure in that chain (quota, RBAC, NSG, stale object) surfaces as 'failed to get node info' because the node controller tries to read a node that doesn't fully exist yet.

The node shows as 'Ready' in kubectl but pods still won't schedule on it after the scale-up. What's wrong?

Check for taints applied automatically during a failed/partial registration cycle: `kubectl describe node | grep Taint`. AKS sometimes applies `node.kubernetes.io/not-ready:NoSchedule` or `node.cloudprovider.kubernetes.io/uninitialized:NoSchedule` that persist after recovery. Remove them with `kubectl taint node node.cloudprovider.kubernetes.io/uninitialized:NoSchedule-`. Also verify the node's allocatable resources aren't zeroed out from the partial registration.

How do I prevent the cluster autoscaler from billing me for orphaned VMs during a stuck scale-up?

Set `--scale-down-unneeded-time=5m` and `--scale-down-unready-time=5m` in your cluster autoscaler configuration. This forces the autoscaler to deprovision VMs that remain unready/unregistered within 5 minutes rather than the default 10–20 minutes. Also enable Azure Cost Management alerts on your MC_ resource group with a budget threshold so runaway VMSS provisioning triggers an alert before it becomes a significant billing event.

Fixing AKS 'Failed to Get Node Info' During NodePool Scale-Up: Root Cause & Resolution

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause

TL;DR

What broke: AKS cluster autoscaler or manual scale-up triggered new node provisioning, but the kubelet on new nodes failed to register — failed to get node info surfaces in kube-controller-manager logs, halting pod scheduling on the new capacity.
How to fix it: Identify whether the failure is Azure VM quota exhaustion, a broken kubelet bootstrap token/RBAC binding, or a corrupted node object in etcd — then execute the targeted remediation below.
Use our Client-Side Sandbox above to drop your nodepool spec or kubectl describe node output and auto-generate the refactored configuration.

The Incident (What Does the Error Mean?)

Raw error from kube-controller-manager or kubectl get events:

E0601 14:32:11.847201       1 node_lifecycle_controller.go:1494] "Failed to get node info" err="node \"aks-nodepool1-12345678-vmss000003\" not found" node="aks-nodepool1-12345678-vmss000003"

Or from the cluster autoscaler:

scale_up.go:453] Failed to get node info for aks-nodepool1-12345678-vmss000003: nodes "aks-nodepool1-12345678-vmss000003" not found

Immediate consequence: The new VM spun up in the VMSS, the OS booted, but the kubelet never successfully completed node registration with the API server. The node object either never appeared in etcd or appeared in a NotReady state and was immediately evicted. Pending pods remain unscheduled. If this is autoscaler-driven, the autoscaler may enter a backoff loop and stop scaling entirely, silently starving your workload of capacity.

The Attack Vector / Blast Radius

This is a cascading availability failure, not a security exploit — but the blast radius is severe:

Autoscaler deadlock: Autoscaler marks the scale-up as failed, increments its backoff timer. In high-traffic scenarios, you lose the 10-minute autoscaler backoff window entirely while pods pile up in Pending.
VMSS billing leak: The Azure VM is running and billing you even though it contributes zero capacity. At scale, a stuck nodepool can orphan 10–50 VMs.
Root causes that compound each other:
- Azure subscription vCPU quota hit mid-scale — VM provisions but ARM returns a partial success; kubelet has no valid identity to register.
- system:node RBAC bootstrap binding broken — the kubelet TLS bootstrap process cannot obtain a signed certificate, so it cannot authenticate to the API server to GET /api/v1/nodes/{name}.
- Corrupted or stale node object — a previous node with the same VMSS instance name exists in a Terminating state, blocking re-registration.
- Custom VNet/NSG blocking port 10250 — kubelet API unreachable, node controller cannot collect node info.

How to Fix It

Step 1: Triage in 60 Seconds

# Check for stuck/ghost node objects
kubectl get nodes | grep -E 'NotReady|Unknown'

# Check autoscaler logs for the specific failure class
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=100 | grep -i 'failed\|error\|node info'

# Check Azure quota for the region/SKU
az vm list-usage --location eastus --output table | grep -i 'cores'

# Check kubelet bootstrap RBAC
kubectl get clusterrolebinding kubelet-bootstrap
kubectl get clusterrolebinding system:node-proxier

Fix A: Stale Node Object Blocking Re-Registration (Most Common)

- # Stale node object holds the name, new kubelet cannot register
- kubectl get node aks-nodepool1-12345678-vmss000003
- # STATUS: Terminating / Unknown for >10 minutes

+ # Force delete the ghost node object
+ kubectl delete node aks-nodepool1-12345678-vmss000003 --grace-period=0 --force
+
+ # Then cordon + drain any surviving VMSS instance and let autoscaler reprovision
+ az vmss delete-instances \
+   --resource-group MC_myRG_myAKS_eastus \
+   --name aks-nodepool1-12345678-vmss \
+   --instance-ids 3

Fix B: Broken kubelet Bootstrap RBAC (Enterprise Best Practice)

AKS manages this binding, but custom AAD/RBAC configurations or manual cluster modifications can break it.

- # Missing or malformed bootstrap binding
- kubectl get clusterrolebinding kubelet-bootstrap
- # Error: clusterrolebindings.rbac.authorization.k8s.io "kubelet-bootstrap" not found

+ # Restore the bootstrap binding
+ kubectl create clusterrolebinding kubelet-bootstrap \
+   --clusterrole=system:node-bootstrapper \
+   --group=system:bootstrappers
+
+ # Verify node-autoapprove binding exists
+ kubectl get clusterrolebinding system:node-autoapprove-bootstrap
+ # If missing:
+ kubectl apply -f https://raw.githubusercontent.com/Azure/AKS/master/examples/bootstrap/node-autoapprove.yaml

Fix C: Azure vCPU Quota Exhaustion

- # Nodepool scale-up requesting Standard_D4s_v3 (4 vCPUs x 10 nodes = 40 vCPUs)
- # Current quota: 32 vCPUs — silent partial provision, 2 VMs boot without valid ARM identity

+ # Request quota increase (takes 1–24 hrs via support ticket)
+ az support tickets create \
+   --ticket-name "vCPU quota increase eastus" \
+   --title "Increase Standard DSv3 quota" \
+   --contact-country "US" \
+   --contact-email "[email protected]"
+
+ # Immediate workaround: switch nodepool to a SKU with available quota
+ az aks nodepool add \
+   --resource-group myRG \
+   --cluster-name myAKS \
+   --name fallbackpool \
+   --node-vm-size Standard_D4s_v4 \
+   --node-count 5

Fix D: NSG Blocking Kubelet Port 10250

- # NSG rule denying inbound 10250 from API server subnet to node subnet
- {
-   "name": "DenyAll",
-   "priority": 100,
-   "direction": "Inbound",
-   "access": "Deny",
-   "protocol": "*",
-   "sourceAddressPrefix": "*",
-   "destinationPortRange": "*"
- }

+ # Add explicit allow rule for AKS control plane to reach kubelet
+ {
+   "name": "AllowAKSControlPlaneKubelet",
+   "priority": 90,
+   "direction": "Inbound",
+   "access": "Allow",
+   "protocol": "Tcp",
+   "sourceAddressPrefix": "AzureCloud.EastUS",
+   "destinationPortRange": "10250"
+ }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Pre-Scale Quota Gate in your Pipeline

# Azure DevOps / GitHub Actions step before any nodepool scale operation
- name: Check vCPU Quota Before Scale
  run: |
    AVAILABLE=$(az vm list-usage --location $REGION \
      --query "[?localName=='Standard DSv3 Family vCPUs'].currentValue" -o tsv)
    LIMIT=$(az vm list-usage --location $REGION \
      --query "[?localName=='Standard DSv3 Family vCPUs'].limit" -o tsv)
    HEADROOM=$((LIMIT - AVAILABLE))
    REQUIRED=$((NODE_COUNT * VCPUS_PER_NODE))
    if [ $HEADROOM -lt $REQUIRED ]; then
      echo "FATAL: Insufficient vCPU quota. Headroom: $HEADROOM, Required: $REQUIRED"
      exit 1
    fi

2. OPA/Gatekeeper Policy — Enforce NodePool SKU Allowlist

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedNodeSKUs
metadata:
  name: enforce-approved-vm-skus
spec:
  match:
    kinds:
      - apiGroups: ["*"]
        kinds: ["Node"]
  parameters:
    allowedSKUs:
      - "Standard_D4s_v3"
      - "Standard_D8s_v3"
      - "Standard_D4s_v4"

3. Checkov / Terraform Lint for NSG Rules

# Add to your Terraform CI pipeline
checkov -d ./infra/aks \
  --check CKV_AZURE_77 \
  --check CKV_AZURE_9 \
  --compact
# CKV_AZURE_77: Ensure AKS node pools use authorized IP ranges
# CKV_AZURE_9:  Ensure NSG does not have unrestricted inbound deny-all before AKS rules

4. Alert on Node Registration Latency

# Prometheus alert — fire if a node stays NotReady > 5 minutes post-provision
- alert: AKSNodeRegistrationStuck
  expr: |
    (kube_node_status_condition{condition="Ready",status="false"} == 1)
    and
    (time() - kube_node_created > 300)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} stuck in NotReady > 5min post-provision"
    runbook: "https://your-wiki/aks-node-registration-runbook"