Initializing Enclave...

Fixing Prometheus Operator 'CRD Validation Failed' After Upgrading CRDs in Kubernetes

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on CR sprawl

TL;DR

  • What broke: After upgrading Prometheus Operator CRDs, existing PrometheusRule, ServiceMonitor, PodMonitor, or Prometheus CRs contain fields that are removed, renamed, or restructured in the new schema — the operator's webhook or controller rejects them on reconciliation.
  • How to fix it: Identify deprecated/removed fields via kubectl get crd prometheusrules.monitoring.coreos.com -o json | jq '.spec.versions', diff your live CRs against the new schema, and patch or migrate each offending resource.
  • Use our Client-Side Sandbox below to auto-refactor this — paste your failing CR YAML and the validation error, and get a schema-corrected manifest instantly without sending your configs to a third-party server.

The Incident (What Does the Error Mean?)

Raw error from kubectl describe prometheusrule or operator pod logs:

ERROR controller-runtime/controller "msg"="Reconciler error" 
  "error"="PrometheusRule.monitoring.coreos.com \"my-alerts\" is invalid: 
  spec.groups[0].rules[0].alert: Invalid value: \"alertname\": 
  spec.groups[0].rules[0].alert in body should match '^[a-zA-Z_][a-zA-Z0-9_]*$'"

or the more destructive variant during a Helm upgrade:

Error: UPGRADE FAILED: cannot patch "prometheus-k8s" with kind Prometheus: 
  Prometheus.monitoring.coreos.com "prometheus-k8s" is invalid: 
  spec.retention: Invalid value: "15d": 
  spec.retention in body must be of type string: must match pattern '^[0-9]+(ms|s|m|h|d|w|y)$'
  spec.ruleSelector: field is immutable after creation

Immediate consequence: The Prometheus Operator controller loop crashes or skips reconciliation for every affected CR. Your alerting rules stop syncing. New ServiceMonitor targets never get scraped. In the worst case, the operator enters a crash-back-off loop and all monitoring across the cluster goes dark.


The Attack Vector / Blast Radius

This is not a security exploit — it is an operational blast radius scenario with monitoring-wide consequences:

  1. Silent alert death: PrometheusRule CRs that fail validation are silently dropped from the Prometheus config. No error surfaces in Grafana. Your on-call team has no idea alerting is broken until an incident goes undetected.
  2. Cascading scrape loss: If the Prometheus CR itself fails validation, the operator stops generating the prometheus.yaml config entirely. Every scrape target — all your services — disappears from Prometheus.
  3. Helm rollback trap: Rolling back the Helm release does NOT downgrade CRDs. CRDs are cluster-scoped and Helm explicitly skips CRD deletion on rollback. You are now running old operator code against new CRD schemas, which is an undefined and dangerous state.
  4. Webhook enforcement: If you have the operator's validating webhook enabled (prometheusrules.monitoring.coreos.com), every kubectl apply of any monitoring resource is now blocked cluster-wide until the schema mismatch is resolved.

Blast radius: 100% of Prometheus-scraped metrics, all alerting rules, and all recording rules across every namespace using the operator.


How to Fix It

Step 1: Identify the Exact Schema Break

# Find all CRD versions and their schema
kubectl get crd prometheusrules.monitoring.coreos.com -o json \
  | jq '.spec.versions[] | {name: .name, schema: .schema.openAPIV3Schema.properties.spec}'

# Validate all existing PrometheusRules against the live CRD
kubectl get prometheusrule -A -o json | \
  kubectl neat | \
  kubeval --schema-location https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/jsonnet/prometheus-operator/

# Check operator logs for specific field rejections
kubectl logs -n monitoring deploy/prometheus-operator --since=30m \
  | grep -E '(invalid|validation|failed|error)' | head -40

Basic Fix — Patch the Offending CR

Identify the deprecated field and patch it directly. Common breakages by version:

Upgrade Path Broken Field Fix
v0.52 → v0.60+ spec.baseImage (Prometheus CR) Replace with spec.image
v0.60 → v0.65+ spec.thanos.baseImage Replace with spec.thanos.image
v0.65 → v0.70+ spec.ruleSelector immutability Recreate CR, do not patch
Any alertname with special chars Rename alert to match ^[a-zA-Z_][a-zA-Z0-9_]*$
# PrometheusRule fix — invalid alert name with hyphen
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
   name: my-alerts
   namespace: monitoring
 spec:
   groups:
   - name: example
     rules:
-    - alert: high-error-rate
+    - alert: HighErrorRate
       expr: rate(http_errors_total[5m]) > 0.05
       for: 5m
# Prometheus CR fix — deprecated baseImage field
 apiVersion: monitoring.coreos.com/v1
 kind: Prometheus
 metadata:
   name: prometheus-k8s
   namespace: monitoring
 spec:
-  baseImage: quay.io/prometheus/prometheus
-  tag: v2.37.0
+  image: quay.io/prometheus/prometheus:v2.45.0
   replicas: 2
   retention: 15d

Enterprise Best Practice — Bulk Migration with Schema Diffing

# 1. Export all CRs before upgrade (DO THIS FIRST, always)
for crd in prometheusrules servicemonitors podmonitors probes prometheuses alertmanagers; do
  kubectl get ${crd} -A -o yaml > backup-${crd}-$(date +%Y%m%d).yaml
done

# 2. Use the operator's migration script if available
# For kube-prometheus-stack Helm upgrades:
helm show values prometheus-community/kube-prometheus-stack \
  --version <NEW_VERSION> > new-values.yaml

# 3. Dry-run the upgrade to surface all validation errors before applying
helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --dry-run \
  --values new-values.yaml \
  2>&1 | grep -A5 'invalid\|validation'

# 4. For immutable field changes (e.g., ruleSelector), you MUST delete and recreate
# Export, delete, re-apply — the operator will re-sync state from the CR
kubectl get prometheus prometheus-k8s -n monitoring -o yaml > prometheus-cr-backup.yaml
kubectl delete prometheus prometheus-k8s -n monitoring
# Edit prometheus-cr-backup.yaml to fix immutable fields
kubectl apply -f prometheus-cr-backup.yaml
# ServiceMonitor fix — removed 'targetPort' in favor of 'port' (string name)
 apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor
 metadata:
   name: my-service-monitor
 spec:
   endpoints:
-  - targetPort: 8080
+  - port: metrics
     interval: 30s
     path: /metrics

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Pre-Upgrade CRD Schema Validation Gate

Add this to your upgrade pipeline before any Helm upgrade runs:

# .github/workflows/prometheus-crd-validate.yaml
- name: Validate Prometheus CRs against new CRD schema
  run: |
    # Install kubeconform with Prometheus Operator schema
    kubeconform \
      -schema-location default \
      -schema-location 'https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/{{.NormalizedKubernetesVersion}}/jsonnet/prometheus-operator/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
      -summary \
      ./monitoring/**/*.yaml

2. OPA/Gatekeeper Policy — Enforce Alert Naming Convention

# ConstraintTemplate: EnforcePrometheusAlertNaming
package prometheusrule.alertnaming

violation[{"msg": msg}] {
  rule := input.review.object.spec.groups[_].rules[_]
  rule.alert
  not re_match(`^[a-zA-Z_][a-zA-Z0-9_]*$`, rule.alert)
  msg := sprintf("Alert name '%v' contains invalid characters. Must match ^[a-zA-Z_][a-zA-Z0-9_]*$", [rule.alert])
}

3. Helm Pre-Upgrade Hook — Backup CRs Automatically

# templates/pre-upgrade-backup-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: backup-monitoring-crs
  annotations:
    "helm.sh/hook": pre-upgrade
    "helm.sh/hook-weight": "-10"
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  template:
    spec:
      serviceAccountName: monitoring-backup-sa
      containers:
      - name: backup
        image: bitnami/kubectl:latest
        command:
        - /bin/sh
        - -c
        - |
          for crd in prometheusrules servicemonitors podmonitors prometheuses; do
            kubectl get ${crd} -A -o yaml > /backup/${crd}-$(date +%s).yaml
          done

4. Renovate/Dependabot — Pin CRD Versions Explicitly

Never allow automatic minor/patch bumps on kube-prometheus-stack without a validation gate. In renovate.json:

{
  "packageRules": [
    {
      "matchPackageNames": ["kube-prometheus-stack"],
      "matchManagers": ["helmv3"],
      "enabled": true,
      "automerge": false,
      "reviewers": ["platform-team"],
      "labels": ["requires-crd-migration-review"]
    }
  ]
}

The non-negotiable rule: CRD upgrades are one-way. Test in a staging cluster that mirrors production CR sprawl. Never upgrade CRDs and operator code simultaneously without a validated rollback plan for every CR type in use.

Related Diagnostics

"Part of the Syntax Utility Matrix."

View all 153 Syntax Tools →