Why does my Argo Rollouts analysis run show 'Inconclusive' instead of 'Failed' or 'Successful'?

Inconclusive means the metric query returned no data or the result couldn't be evaluated against your successCondition/failureCondition. The most common causes are: (1) the canary pod hasn't received enough traffic yet — add `initialDelay: 2m` to your metric spec; (2) your Prometheus query returns a labeled vector instead of a scalar — wrap it in `scalar()`; (3) the metric provider endpoint is unreachable. After `inconclusiveLimit` is exceeded, the run transitions to Failed.

How do I test my Prometheus query before deploying an AnalysisTemplate?

Run the query directly against your Prometheus API: `curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=YOUR_QUERY'`. Verify the response `data.result` array has exactly one element and its value is a scalar float. If `result` is an array with multiple labeled entries, your query is returning a vector — add `scalar()` or use aggregation operators like `sum()` without grouping labels to collapse it.

Can I rerun a failed AnalysisRun without re-triggering the entire rollout?

No — AnalysisRuns are immutable once they reach a terminal phase (Failed, Successful, Error). To retry, you must either: (1) patch the Rollout to trigger a new revision (`kubectl argo rollouts retry rollout `), which creates a new AnalysisRun; or (2) if the rollout is in a Degraded state, use `kubectl argo rollouts abort ` followed by fixing the AnalysisTemplate and re-deploying. You cannot mutate a completed AnalysisRun object.

How to Fix Argo Rollouts 'Analysis Run Failed' Canary Metric Errors

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Argo Rollouts AnalysisRun hit Failed phase because the metric query returned no data, the successCondition expression evaluated false, or the Prometheus/Datadog provider was misconfigured — rollout is now permanently blocked or auto-aborted.
How to fix it: Correct the successCondition/failureCondition CEL expressions, fix the metric provider endpoint or query, and validate with kubectl argo rollouts get rollout <name> --watch.
Shortcut: Use our Client-Side Sandbox below to auto-refactor your failing AnalysisTemplate YAML without leaking your internal metric endpoints.

The Incident (What Does the Error Mean?)

Raw event output from a failing rollout:

Message: AnalysisRun default/myapp-canary-analysis-run-abc123 Failed
Reason:  Metric "error-rate" assessed Failed due to failureCondition 'result > 0.05' evaluated true
Status:  Phase: Failed
         Message: metric "success-rate" completed. Phase: Failed

Or the silent killer — a metric that returns no data:

Message: metric "p99-latency" assessed Inconclusive
Reason:  Insufficient data: no data returned from provider
Phase:   Inconclusive -> (after failureLimit exceeded) -> Failed

Immediate consequence: The canary step is aborted. Argo Rollouts automatically rolls back to the stable ReplicaSet. If progressDeadlineSeconds is tight, you get cascading rollback loops. If autoPromotionEnabled: false, the rollout stalls indefinitely, blocking all subsequent deploys to that namespace.

The Attack Vector / Blast Radius

This is a deployment availability failure, not a security exploit — but the blast radius is severe:

Rollback storm: Auto-rollback fires, but if the stable version also has issues, you're oscillating between two bad states.
Metric provider misconfiguration leaks: A wrong Prometheus URL or bad bearer token in secretKeyRef means every analysis run silently fails with Inconclusive, and after failureLimit is exceeded, all canaries are permanently blocked — you lose the ability to ship.
CEL expression logic inversion: successCondition: result[0] >= 0.99 fails when Prometheus returns a float like 0.9901 due to label cardinality returning a vector instead of a scalar. The entire pipeline stalls.
RBAC gap: If the argo-rollouts ServiceAccount lacks get/list on the AnalysisRun CRD or the Secret containing the metric provider token, every run fails with a permissions error that looks like a metric failure.

How to Fix It (The Solution)

Root Cause Checklist

Run this first:

# Get the full failure reason
kubectl describe analysisrun <run-name> -n <namespace>

# Check controller logs for provider errors
kubectl logs -n argo-rollouts deploy/argo-rollouts | grep -i "analysis\|metric\|error" | tail -50

# Manually test your Prometheus query
curl -G 'http://prometheus:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'

Basic Fix — Correcting `successCondition` and Query Scalar Return

The most common failure: Prometheus query returns a vector, but the condition expects a scalar.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-analysis
spec:
  metrics:
  - name: error-rate
    interval: 1m
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
-         sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
-         /
-         sum(rate(http_requests_total{app="myapp"}[5m]))
+         scalar(
+           sum(rate(http_requests_total{status=~"5..",app="myapp"}[5m]))
+           /
+           sum(rate(http_requests_total{app="myapp"}[5m]))
+         )
-   successCondition: result[0] < 0.05
+   successCondition: result < 0.05
-   failureCondition: result[0] >= 0.05
+   failureCondition: result >= 0.05

Enterprise Best Practice — Full Hardened AnalysisTemplate

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-success-rate
  namespace: production
spec:
+ args:
+ - name: service-name
+ - name: namespace
  metrics:
  - name: success-rate
-   interval: 30s
-   count: 5
+   interval: 1m
+   count: 10
+   initialDelay: 2m   # wait for metrics to populate post-deploy
    failureLimit: 2
+   inconclusiveLimit: 3  # prevent infinite Inconclusive -> stall
    provider:
      prometheus:
-       address: http://prometheus:9090
+       address: http://prometheus-operated.monitoring.svc.cluster.local:9090
        query: |
+         scalar(
            sum(rate(
-             http_requests_total{status!~"5.."}[2m]
+             http_requests_total{status!~"5..",service="{{args.service-name}}"}[5m]
            ))
            /
            sum(rate(
-             http_requests_total[2m]
+             http_requests_total{service="{{args.service-name}}"}[5m]
            ))
+         )
-   successCondition: result[0] >= 0.95
+   successCondition: result >= 0.99
-   failureCondition: result[0] < 0.95
+   failureCondition: result < 0.95

Critical additions explained:

initialDelay: 2m — canary pods need time to receive traffic before metrics exist. Without this, the first scrape returns no data → Inconclusive → failure.
inconclusiveLimit: 3 — caps how many no-data responses are tolerated before hard failure.
scalar() wrapper — forces Prometheus to return a single float, not a labeled vector.
Parameterized args — reuse the same template across services without hardcoding label selectors.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Validate AnalysisTemplates Pre-Merge with `kubectl --dry-run`

kubectl apply --dry-run=server -f analysis-template.yaml

2. OPA/Conftest Policy — Enforce `initialDelay` and `inconclusiveLimit`

# policy/analysis_template.rego
package argorollouts

deny[msg] {
  input.kind == "AnalysisTemplate"
  metric := input.spec.metrics[_]
  not metric.initialDelay
  msg := sprintf("Metric '%v' is missing initialDelay — will fail on cold start", [metric.name])
}

deny[msg] {
  input.kind == "AnalysisTemplate"
  metric := input.spec.metrics[_]
  not metric.inconclusiveLimit
  msg := sprintf("Metric '%v' has no inconclusiveLimit — rollout can stall indefinitely", [metric.name])
}

# In your CI pipeline
conftest test analysis-template.yaml --policy policy/

3. Prometheus Query Smoke Test in CI

# Validate query returns a scalar before it ever hits production
RESULT=$(curl -sf -G "$PROM_URL/api/v1/query" \
  --data-urlencode "query=$METRIC_QUERY" | jq '.data.result | length')

[ "$RESULT" -eq 1 ] || { echo "ERROR: Query returns vector or no data"; exit 1; }

4. RBAC — Minimum ServiceAccount Permissions

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: argo-rollouts-analysis
rules:
- apiGroups: ["argoproj.io"]
  resources: ["analysisruns", "analysistemplates", "clusteranalysistemplates"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["secrets"]  # for metric provider tokens
  verbs: ["get"]

Bind this to the argo-rollouts ServiceAccount in every namespace where rollouts run. Missing secrets/get is a silent killer — the controller logs a permissions error that surfaces as a metric failure, not an RBAC error.