Initializing Enclave...

How to Fix Argo Rollouts 'Analysis Run Failed' Canary Metric Errors

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: Argo Rollouts AnalysisRun hit Failed phase because the metric query returned no data, the successCondition expression evaluated false, or the Prometheus/Datadog provider was misconfigured — rollout is now permanently blocked or auto-aborted.
  • How to fix it: Correct the successCondition/failureCondition CEL expressions, fix the metric provider endpoint or query, and validate with kubectl argo rollouts get rollout <name> --watch.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing AnalysisTemplate YAML without leaking your internal metric endpoints.

The Incident (What Does the Error Mean?)

Raw event output from a failing rollout:

Message: AnalysisRun default/myapp-canary-analysis-run-abc123 Failed
Reason:  Metric "error-rate" assessed Failed due to failureCondition 'result > 0.05' evaluated true
Status:  Phase: Failed
         Message: metric "success-rate" completed. Phase: Failed

Or the silent killer — a metric that returns no data:

Message: metric "p99-latency" assessed Inconclusive
Reason:  Insufficient data: no data returned from provider
Phase:   Inconclusive -> (after failureLimit exceeded) -> Failed

Immediate consequence: The canary step is aborted. Argo Rollouts automatically rolls back to the stable ReplicaSet. If progressDeadlineSeconds is tight, you get cascading rollback loops. If autoPromotionEnabled: false, the rollout stalls indefinitely, blocking all subsequent deploys to that namespace.


The Attack Vector / Blast Radius

This is a deployment availability failure, not a security exploit — but the blast radius is severe:

  1. Rollback storm: Auto-rollback fires, but if the stable version also has issues, you're oscillating between two bad states.
  2. Metric provider misconfiguration leaks: A wrong Prometheus URL or bad bearer token in secretKeyRef means every analysis run silently fails with Inconclusive, and after failureLimit is exceeded, all canaries are permanently blocked — you lose the ability to ship.
  3. CEL expression logic inversion: successCondition: result[0] >= 0.99 fails when Prometheus returns a float like 0.9901 due to label cardinality returning a vector instead of a scalar. The entire pipeline stalls.
  4. RBAC gap: If the argo-rollouts ServiceAccount lacks get/list on the AnalysisRun CRD or the Secret containing the metric provider token, every run fails with a permissions error that looks like a metric failure.

How to Fix It (The Solution)

Root Cause Checklist

Run this first:

# Get the full failure reason
kubectl describe analysisrun <run-name> -n <namespace>

# Check controller logs for provider errors
kubectl logs -n argo-rollouts deploy/argo-rollouts | grep -i "analysis\|metric\|error" | tail -50

# Manually test your Prometheus query
curl -G 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

Basic Fix — Correcting successCondition and Query Scalar Return

The most common failure: Prometheus query returns a vector, but the condition expects a scalar.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-analysis
spec:
  metrics:
  - name: error-rate
    interval: 1m
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
-         sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
-         /
-         sum(rate(http_requests_total{app="myapp"}[5m]))
+         scalar(
+           sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
+           /
+           sum(rate(http_requests_total{app="myapp"}[5m]))
+         )
-   successCondition: result[0] < 0.05
+   successCondition: result < 0.05
-   failureCondition: result[0] >= 0.05
+   failureCondition: result >= 0.05

Enterprise Best Practice — Full Hardened AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-success-rate
  namespace: production
spec:
+ args:
+ - name: service-name
+ - name: namespace
  metrics:
  - name: success-rate
-   interval: 30s
-   count: 5
+   interval: 1m
+   count: 10
+   initialDelay: 2m   # wait for metrics to populate post-deploy
    failureLimit: 2
+   inconclusiveLimit: 3  # prevent infinite Inconclusive -> stall
    provider:
      prometheus:
-       address: http://prometheus:9090
+       address: http://prometheus-operated.monitoring.svc.cluster.local:9090
        query: |
+         scalar(
            sum(rate(
-             http_requests_total{status!~"5.."}[2m]
+             http_requests_total{status!~"5..",service="{{args.service-name}}"}[5m]
            ))
            /
            sum(rate(
-             http_requests_total[2m]
+             http_requests_total{service="{{args.service-name}}"}[5m]
            ))
+         )
-   successCondition: result[0] >= 0.95
+   successCondition: result >= 0.99
-   failureCondition: result[0] < 0.95
+   failureCondition: result < 0.95

Critical additions explained:

  • initialDelay: 2m — canary pods need time to receive traffic before metrics exist. Without this, the first scrape returns no data → Inconclusive → failure.
  • inconclusiveLimit: 3 — caps how many no-data responses are tolerated before hard failure.
  • scalar() wrapper — forces Prometheus to return a single float, not a labeled vector.
  • Parameterized args — reuse the same template across services without hardcoding label selectors.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate AnalysisTemplates Pre-Merge with kubectl --dry-run

kubectl apply --dry-run=server -f analysis-template.yaml

2. OPA/Conftest Policy — Enforce initialDelay and inconclusiveLimit

# policy/analysis_template.rego
package argorollouts

deny[msg] {
  input.kind == "AnalysisTemplate"
  metric := input.spec.metrics[_]
  not metric.initialDelay
  msg := sprintf("Metric '%v' is missing initialDelay — will fail on cold start", [metric.name])
}

deny[msg] {
  input.kind == "AnalysisTemplate"
  metric := input.spec.metrics[_]
  not metric.inconclusiveLimit
  msg := sprintf("Metric '%v' has no inconclusiveLimit — rollout can stall indefinitely", [metric.name])
}
# In your CI pipeline
conftest test analysis-template.yaml --policy policy/

3. Prometheus Query Smoke Test in CI

# Validate query returns a scalar before it ever hits production
RESULT=$(curl -sf -G "$PROM_URL/api/v1/query" \
  --data-urlencode "query=$METRIC_QUERY" | jq '.data.result | length')

[ "$RESULT" -eq 1 ] || { echo "ERROR: Query returns vector or no data"; exit 1; }

4. RBAC — Minimum ServiceAccount Permissions

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-rollouts-analysis
rules:
- apiGroups: ["argoproj.io"]
  resources: ["analysisruns", "analysistemplates", "clusteranalysistemplates"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["secrets"]  # for metric provider tokens
  verbs: ["get"]

Bind this to the argo-rollouts ServiceAccount in every namespace where rollouts run. Missing secrets/get is a silent killer — the controller logs a permissions error that surfaces as a metric failure, not an RBAC error.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →