How to Fix Argo Rollouts 'Analysis Run Failed' Canary Metric Errors
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: Argo Rollouts
AnalysisRunhitFailedphase because the metric query returned no data, thesuccessConditionexpression evaluated false, or the Prometheus/Datadog provider was misconfigured — rollout is now permanently blocked or auto-aborted. - How to fix it: Correct the
successCondition/failureConditionCEL expressions, fix the metric provider endpoint or query, and validate withkubectl argo rollouts get rollout <name> --watch. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing
AnalysisTemplateYAML without leaking your internal metric endpoints.
The Incident (What Does the Error Mean?)
Raw event output from a failing rollout:
Message: AnalysisRun default/myapp-canary-analysis-run-abc123 Failed
Reason: Metric "error-rate" assessed Failed due to failureCondition 'result > 0.05' evaluated true
Status: Phase: Failed
Message: metric "success-rate" completed. Phase: Failed
Or the silent killer — a metric that returns no data:
Message: metric "p99-latency" assessed Inconclusive
Reason: Insufficient data: no data returned from provider
Phase: Inconclusive -> (after failureLimit exceeded) -> Failed
Immediate consequence: The canary step is aborted. Argo Rollouts automatically rolls back to the stable ReplicaSet. If progressDeadlineSeconds is tight, you get cascading rollback loops. If autoPromotionEnabled: false, the rollout stalls indefinitely, blocking all subsequent deploys to that namespace.
The Attack Vector / Blast Radius
This is a deployment availability failure, not a security exploit — but the blast radius is severe:
- Rollback storm: Auto-rollback fires, but if the stable version also has issues, you're oscillating between two bad states.
- Metric provider misconfiguration leaks: A wrong Prometheus URL or bad bearer token in
secretKeyRefmeans every analysis run silently fails withInconclusive, and afterfailureLimitis exceeded, all canaries are permanently blocked — you lose the ability to ship. - CEL expression logic inversion:
successCondition: result[0] >= 0.99fails when Prometheus returns a float like0.9901due to label cardinality returning a vector instead of a scalar. The entire pipeline stalls. - RBAC gap: If the
argo-rolloutsServiceAccount lacksget/liston theAnalysisRunCRD or the Secret containing the metric provider token, every run fails with a permissions error that looks like a metric failure.
How to Fix It (The Solution)
Root Cause Checklist
Run this first:
# Get the full failure reason
kubectl describe analysisrun <run-name> -n <namespace>
# Check controller logs for provider errors
kubectl logs -n argo-rollouts deploy/argo-rollouts | grep -i "analysis\|metric\|error" | tail -50
# Manually test your Prometheus query
curl -G 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
Basic Fix — Correcting successCondition and Query Scalar Return
The most common failure: Prometheus query returns a vector, but the condition expects a scalar.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-analysis
spec:
metrics:
- name: error-rate
interval: 1m
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
- sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
- /
- sum(rate(http_requests_total{app="myapp"}[5m]))
+ scalar(
+ sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
+ /
+ sum(rate(http_requests_total{app="myapp"}[5m]))
+ )
- successCondition: result[0] < 0.05
+ successCondition: result < 0.05
- failureCondition: result[0] >= 0.05
+ failureCondition: result >= 0.05
Enterprise Best Practice — Full Hardened AnalysisTemplate
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-success-rate
namespace: production
spec:
+ args:
+ - name: service-name
+ - name: namespace
metrics:
- name: success-rate
- interval: 30s
- count: 5
+ interval: 1m
+ count: 10
+ initialDelay: 2m # wait for metrics to populate post-deploy
failureLimit: 2
+ inconclusiveLimit: 3 # prevent infinite Inconclusive -> stall
provider:
prometheus:
- address: http://prometheus:9090
+ address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
+ scalar(
sum(rate(
- http_requests_total{status!~"5.."}[2m]
+ http_requests_total{status!~"5..",service="{{args.service-name}}"}[5m]
))
/
sum(rate(
- http_requests_total[2m]
+ http_requests_total{service="{{args.service-name}}"}[5m]
))
+ )
- successCondition: result[0] >= 0.95
+ successCondition: result >= 0.99
- failureCondition: result[0] < 0.95
+ failureCondition: result < 0.95
Critical additions explained:
initialDelay: 2m— canary pods need time to receive traffic before metrics exist. Without this, the first scrape returns no data →Inconclusive→ failure.inconclusiveLimit: 3— caps how many no-data responses are tolerated before hard failure.scalar()wrapper — forces Prometheus to return a single float, not a labeled vector.- Parameterized
args— reuse the same template across services without hardcoding label selectors.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate AnalysisTemplates Pre-Merge with kubectl --dry-run
kubectl apply --dry-run=server -f analysis-template.yaml
2. OPA/Conftest Policy — Enforce initialDelay and inconclusiveLimit
# policy/analysis_template.rego
package argorollouts
deny[msg] {
input.kind == "AnalysisTemplate"
metric := input.spec.metrics[_]
not metric.initialDelay
msg := sprintf("Metric '%v' is missing initialDelay — will fail on cold start", [metric.name])
}
deny[msg] {
input.kind == "AnalysisTemplate"
metric := input.spec.metrics[_]
not metric.inconclusiveLimit
msg := sprintf("Metric '%v' has no inconclusiveLimit — rollout can stall indefinitely", [metric.name])
}
# In your CI pipeline
conftest test analysis-template.yaml --policy policy/
3. Prometheus Query Smoke Test in CI
# Validate query returns a scalar before it ever hits production
RESULT=$(curl -sf -G "$PROM_URL/api/v1/query" \
--data-urlencode "query=$METRIC_QUERY" | jq '.data.result | length')
[ "$RESULT" -eq 1 ] || { echo "ERROR: Query returns vector or no data"; exit 1; }
4. RBAC — Minimum ServiceAccount Permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: argo-rollouts-analysis
rules:
- apiGroups: ["argoproj.io"]
resources: ["analysisruns", "analysistemplates", "clusteranalysistemplates"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["secrets"] # for metric provider tokens
verbs: ["get"]
Bind this to the argo-rollouts ServiceAccount in every namespace where rollouts run. Missing secrets/get is a silent killer — the controller logs a permissions error that surfaces as a metric failure, not an RBAC error.