Initializing Enclave...

Fixing KEDA 'scaledobject failed to get metrics' When Prometheus Is Down or Unreachable

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: KEDA's Prometheus scaler cannot reach serverAddress, causing ScaledObject to enter a failed metrics state — replicas freeze at last known count or scale to zero depending on fallback config.
  • How to fix it: Verify Prometheus is reachable from the KEDA operator pod, fix the serverAddress URL, confirm the PromQL query returns data, and configure a fallback replica floor.
  • Use our Client-Side Sandbox above to paste your failing ScaledObject YAML and auto-generate a hardened, refactored version without leaking your cluster config to third-party AI backends.

The Incident (What Does the Error Mean?)

You will see this in kubectl describe scaledobject <name> -n <namespace> or in KEDA operator logs:

ERROR   scale_handler   failed to get metrics for scaledObject
{
  "scaledObject": "my-app/my-scaledobject",
  "error": "error getting metric source: unable to get external metric
            my-app-prometheus: unable to fetch metrics from prometheus:
            Get \"http://prometheus-server.monitoring.svc:9090/api/v1/query\":
            dial tcp: lookup prometheus-server.monitoring.svc: no such host"
}

Immediate consequence: KEDA cannot evaluate the trigger. The HPA backing the ScaledObject receives no metric update. Depending on your fallback block, your deployment either freezes at current replicas or scales to zero — both are production-breaking outcomes.


The Attack Vector / Blast Radius

This is not a security exploit — it is a cascading availability failure:

  1. Prometheus pod crash / OOMKilled / evicted → KEDA loses its only metric source.
  2. ScaledObject enters Failed condition → Kubernetes HPA stops receiving external metrics.
  3. Without a fallback block, KEDA defaults to 0 desired replicas on some versions — your entire workload scales to zero under load.
  4. With a poorly set fallback, replicas freeze at a stale count — your service either under-provisions during a traffic spike or burns compute during idle.
  5. Network policy misconfiguration is the second most common cause: the KEDA operator namespace cannot reach the Prometheus namespace on port 9090, silently failing DNS resolution or TCP connection.

Blast radius: Full autoscaling blindness. Every ScaledObject using that Prometheus endpoint is affected simultaneously.


How to Fix It

Step 0 — Confirm the actual failure point

# Check ScaledObject conditions
kubectl describe scaledobject my-scaledobject -n my-app

# Check KEDA operator logs directly
kubectl logs -n keda deployment/keda-operator --since=10m | grep -i "prometheus\|failed to get"

# Test reachability FROM the KEDA operator pod
kubectl exec -n keda deployment/keda-operator -- \
  wget -qO- http://prometheus-server.monitoring.svc.cluster.local:9090/api/v1/query?query=up

If that wget fails, the problem is network/DNS, not KEDA config.


Basic Fix — Correct the serverAddress and add fallback

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-scaledobject
  namespace: my-app
spec:
  scaleTargetRef:
    name: my-deployment
  minReplicaCount: 1
  maxReplicaCount: 20
+ fallback:
+   failureThreshold: 3
+   replicas: 5
  triggers:
    - type: prometheus
      metadata:
-       serverAddress: http://prometheus:9090
+       serverAddress: http://prometheus-server.monitoring.svc.cluster.local:9090
-       query: http_requests_total
+       query: sum(rate(http_requests_total{namespace="my-app"}[2m]))
        threshold: "100"
+       ignoreNullValues: "false"

Key changes:

  • serverAddress must use the fully qualified in-cluster DNS name (<service>.<namespace>.svc.cluster.local). Short names fail across namespaces.
  • query must be a valid PromQL expression that returns a scalar or single-value instant vector. Bare metric names without aggregation will return multi-series results and cause parsing failures.
  • fallback.replicas: 5 ensures your workload holds a safe floor when Prometheus is unreachable instead of collapsing to zero.

Enterprise Best Practice — TriggerAuthentication + NetworkPolicy + Fallback

# 1. If Prometheus has auth enabled, use TriggerAuthentication
+apiVersion: keda.sh/v1alpha1
+kind: TriggerAuthentication
+metadata:
+  name: prometheus-auth
+  namespace: my-app
+spec:
+  secretTargetRef:
+    - parameter: bearerToken
+      name: prometheus-bearer-secret
+      key: token

# 2. ScaledObject referencing auth + hardened query
 apiVersion: keda.sh/v1alpha1
 kind: ScaledObject
 metadata:
   name: my-scaledobject
   namespace: my-app
 spec:
   scaleTargetRef:
     name: my-deployment
   minReplicaCount: 2
   maxReplicaCount: 50
+  pollingInterval: 30
+  cooldownPeriod: 60
+  fallback:
+    failureThreshold: 3
+    replicas: 10
   triggers:
     - type: prometheus
       metadata:
-        serverAddress: http://prometheus:9090
+        serverAddress: https://prometheus-server.monitoring.svc.cluster.local:9090
+        query: sum(rate(http_requests_total{job="my-app",namespace="my-app"}[2m]))
         threshold: "100"
+        ignoreNullValues: "false"
+      authenticationRef:
+        name: prometheus-auth

# 3. NetworkPolicy — allow KEDA operator egress to Prometheus
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: allow-keda-to-prometheus
+  namespace: monitoring
+spec:
+  podSelector:
+    matchLabels:
+      app: prometheus
+  ingress:
+    - from:
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: keda
+      ports:
+        - protocol: TCP
+          port: 9090

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate ScaledObject manifests pre-deploy with Conftest/OPA

# policy/keda_scaledobject.rego
package keda

deny[msg] {
  input.kind == "ScaledObject"
  not input.spec.fallback
  msg := "ScaledObject must define a fallback block to prevent scale-to-zero on metric failure"
}

deny[msg] {
  input.kind == "ScaledObject"
  trigger := input.spec.triggers[_]
  trigger.type == "prometheus"
  not contains(trigger.metadata.serverAddress, ".svc.cluster.local")
  msg := "Prometheus serverAddress must use fully qualified cluster DNS to ensure cross-namespace resolution"
}
# Run in CI pipeline
conftest test scaledobject.yaml --policy policy/

2. Checkov custom check for KEDA

Checkov does not natively parse KEDA CRDs, but you can add a custom check using checkov --check CKV_CUSTOM_KEDA_001 pattern with the Python SDK targeting spec.fallback presence.

3. Prometheus Alerting — alert before KEDA notices

# Alert fires if Prometheus itself is the problem
- alert: PrometheusTargetDown
  expr: up{job="keda-metrics-source"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Prometheus target used by KEDA is down — autoscaling blind"

4. Helm/ArgoCD pre-sync hook

Add a helm test or ArgoCD PreSync hook that runs a wget or curl against the serverAddress from within the target namespace before deploying ScaledObject manifests. Fail the sync if the endpoint is unreachable.


Bottom line: KEDA trusts that your metric backend is alive. It has no circuit breaker by default. You must build the safety net yourself via fallback, network policies, and CI-time policy enforcement.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →