Fixing KEDA 'scaledobject failed to get metrics' When Prometheus Is Down or Unreachable
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: KEDA's Prometheus scaler cannot reach
serverAddress, causingScaledObjectto enter a failed metrics state — replicas freeze at last known count or scale to zero depending onfallbackconfig. - How to fix it: Verify Prometheus is reachable from the KEDA operator pod, fix the
serverAddressURL, confirm the PromQL query returns data, and configure afallbackreplica floor. - Use our Client-Side Sandbox above to paste your failing ScaledObject YAML and auto-generate a hardened, refactored version without leaking your cluster config to third-party AI backends.
The Incident (What Does the Error Mean?)
You will see this in kubectl describe scaledobject <name> -n <namespace> or in KEDA operator logs:
ERROR scale_handler failed to get metrics for scaledObject
{
"scaledObject": "my-app/my-scaledobject",
"error": "error getting metric source: unable to get external metric
my-app-prometheus: unable to fetch metrics from prometheus:
Get \"http://prometheus-server.monitoring.svc:9090/api/v1/query\":
dial tcp: lookup prometheus-server.monitoring.svc: no such host"
}
Immediate consequence: KEDA cannot evaluate the trigger. The HPA backing the ScaledObject receives no metric update. Depending on your fallback block, your deployment either freezes at current replicas or scales to zero — both are production-breaking outcomes.
The Attack Vector / Blast Radius
This is not a security exploit — it is a cascading availability failure:
- Prometheus pod crash / OOMKilled / evicted → KEDA loses its only metric source.
- ScaledObject enters
Failedcondition → Kubernetes HPA stops receiving external metrics. - Without a
fallbackblock, KEDA defaults to0desired replicas on some versions — your entire workload scales to zero under load. - With a poorly set
fallback, replicas freeze at a stale count — your service either under-provisions during a traffic spike or burns compute during idle. - Network policy misconfiguration is the second most common cause: the KEDA operator namespace cannot reach the Prometheus namespace on port 9090, silently failing DNS resolution or TCP connection.
Blast radius: Full autoscaling blindness. Every ScaledObject using that Prometheus endpoint is affected simultaneously.
How to Fix It
Step 0 — Confirm the actual failure point
# Check ScaledObject conditions
kubectl describe scaledobject my-scaledobject -n my-app
# Check KEDA operator logs directly
kubectl logs -n keda deployment/keda-operator --since=10m | grep -i "prometheus\|failed to get"
# Test reachability FROM the KEDA operator pod
kubectl exec -n keda deployment/keda-operator -- \
wget -qO- http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/query?query=up
If that wget fails, the problem is network/DNS, not KEDA config.
Basic Fix — Correct the serverAddress and add fallback
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-scaledobject
namespace: my-app
spec:
scaleTargetRef:
name: my-deployment
minReplicaCount: 1
maxReplicaCount: 20
+ fallback:
+ failureThreshold: 3
+ replicas: 5
triggers:
- type: prometheus
metadata:
- serverAddress: http://prometheus:9090
+ serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
- query: http_requests_total
+ query: sum(rate(http_requests_total{namespace="my-app"}[2m]))
threshold: "100"
+ ignoreNullValues: "false"
Key changes:
serverAddressmust use the fully qualified in-cluster DNS name (<service>.<namespace>.svc.cluster.local). Short names fail across namespaces.querymust be a valid PromQL expression that returns a scalar or single-value instant vector. Bare metric names without aggregation will return multi-series results and cause parsing failures.fallback.replicas: 5ensures your workload holds a safe floor when Prometheus is unreachable instead of collapsing to zero.
Enterprise Best Practice — TriggerAuthentication + NetworkPolicy + Fallback
# 1. If Prometheus has auth enabled, use TriggerAuthentication
+apiVersion: keda.sh/v1alpha1
+kind: TriggerAuthentication
+metadata:
+ name: prometheus-auth
+ namespace: my-app
+spec:
+ secretTargetRef:
+ - parameter: bearerToken
+ name: prometheus-bearer-secret
+ key: token
# 2. ScaledObject referencing auth + hardened query
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-scaledobject
namespace: my-app
spec:
scaleTargetRef:
name: my-deployment
minReplicaCount: 2
maxReplicaCount: 50
+ pollingInterval: 30
+ cooldownPeriod: 60
+ fallback:
+ failureThreshold: 3
+ replicas: 10
triggers:
- type: prometheus
metadata:
- serverAddress: http://prometheus:9090
+ serverAddress: https://prometheus-server.monitoring.svc.cluster.local:9090
+ query: sum(rate(http_requests_total{job="my-app",namespace="my-app"}[2m]))
threshold: "100"
+ ignoreNullValues: "false"
+ authenticationRef:
+ name: prometheus-auth
# 3. NetworkPolicy — allow KEDA operator egress to Prometheus
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+ name: allow-keda-to-prometheus
+ namespace: monitoring
+spec:
+ podSelector:
+ matchLabels:
+ app: prometheus
+ ingress:
+ - from:
+ - namespaceSelector:
+ matchLabels:
+ kubernetes.io/metadata.name: keda
+ ports:
+ - protocol: TCP
+ port: 9090
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate ScaledObject manifests pre-deploy with Conftest/OPA
# policy/keda_scaledobject.rego
package keda
deny[msg] {
input.kind == "ScaledObject"
not input.spec.fallback
msg := "ScaledObject must define a fallback block to prevent scale-to-zero on metric failure"
}
deny[msg] {
input.kind == "ScaledObject"
trigger := input.spec.triggers[_]
trigger.type == "prometheus"
not contains(trigger.metadata.serverAddress, ".svc.cluster.local")
msg := "Prometheus serverAddress must use fully qualified cluster DNS to ensure cross-namespace resolution"
}
# Run in CI pipeline
conftest test scaledobject.yaml --policy policy/
2. Checkov custom check for KEDA
Checkov does not natively parse KEDA CRDs, but you can add a custom check using checkov --check CKV_CUSTOM_KEDA_001 pattern with the Python SDK targeting spec.fallback presence.
3. Prometheus Alerting — alert before KEDA notices
# Alert fires if Prometheus itself is the problem
- alert: PrometheusTargetDown
expr: up{job="keda-metrics-source"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Prometheus target used by KEDA is down — autoscaling blind"
4. Helm/ArgoCD pre-sync hook
Add a helm test or ArgoCD PreSync hook that runs a wget or curl against the serverAddress from within the target namespace before deploying ScaledObject manifests. Fail the sync if the endpoint is unreachable.
Bottom line: KEDA trusts that your metric backend is alive. It has no circuit breaker by default. You must build the safety net yourself via fallback, network policies, and CI-time policy enforcement.