Initializing Enclave...

Fixing Cert-Manager 'Certificate Request Failed' Due to Let's Encrypt Rate Limits

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–60 mins (rate limit window may force a 1-hour to 7-day wait)

TL;DR

  • What broke: Cert-Manager is repeatedly submitting ACME certificate orders to Let's Encrypt, exhausting the rate limit (5 failed validations/hostname/hour, 50 certs/registered domain/week). All new CertificateRequest objects are rejected with 429 Too Many Requests.
  • How to fix it: Stop the bleeding immediately by patching the failing Certificate resource to suspend retries, diagnose the root cause (wrong solver, missing DNS record, misconfigured ingress), switch to the Let's Encrypt Staging endpoint to test, then re-issue against production once validated.
  • Use our Client-Side Sandbox below to paste your failing Certificate or ClusterIssuer YAML and auto-refactor it without leaking your domain or ACME account key.

The Incident (What Does the Error Mean?)

Raw error from kubectl describe certificaterequest <name> -n <namespace>:

Status:
  Conditions:
    Message: Failed to wait for order resource "example-com-xyz" to become ready: 
             order is in "invalid" state: 
             Failed to finalize order: 429 urn:ietf:params:acme:error:rateLimited :: 
             Error creating new order :: too many certificates already issued for 
             registered domain "example.com": see https://letsencrypt.org/docs/rate-limits/
    Reason:  Failed
    Type:    Ready  False

And from cert-manager controller logs (kubectl logs -n cert-manager deploy/cert-manager):

E0610 14:32:01.123456 1 sync.go:182] cert-manager/controller/orders 
  "msg"="Failed to determine the list of Challenge resources needed for the Order" 
  "error"="429: too many certificates already issued for: example.com"

Immediate consequence: Every Certificate resource targeting the affected registered domain is stuck in False/Ready. Ingress TLS termination fails for new deployments. Existing certificates continue serving until expiry — but any rotation attempt will also fail for up to 7 days (the weekly issuance window).


The Attack Vector / Blast Radius

This is not a security exploit — it is a self-inflicted denial-of-service against your own certificate pipeline. The blast radius cascades as follows:

  1. Helm chart re-deploys or GitOps reconciliation loops (ArgoCD, Flux) repeatedly delete and recreate Certificate objects, each triggering a new ACME Order. 50 certs/registered domain/week is consumed in minutes.
  2. Wildcard vs. SAN misconfiguration causes cert-manager to request *.example.com AND example.com as separate orders instead of a single SAN certificate — doubling consumption.
  3. HTTP-01 solver misconfiguration (wrong ingress class, missing /.well-known/acme-challenge/ route) causes every validation attempt to fail. Let's Encrypt's 5 failed validations per hostname per hour limit triggers independently and stacks on top of the issuance limit.
  4. Once rate-limited, all subdomains under the registered domain (api.example.com, app.example.com) are blocked — not just the offending hostname.
  5. Staging environments sharing the same registered domain as production will consume the same production rate limit quota if pointed at the production ACME endpoint.

How to Fix It

Step 0: Stop the Bleeding (Immediate)

Patch cert-manager to stop retrying immediately. Identify the failing Certificate and annotate it to prevent re-queuing:

# Identify all failing Certificate resources
kubectl get certificates -A | grep -v True

# Delete the stuck Order to stop retry loop (cert-manager will NOT re-issue if rate limited)
kubectl delete order -n <namespace> <order-name>

# If ArgoCD/Flux is re-creating it, suspend the Application temporarily
argocd app patch <app-name> --patch '{"spec":{"syncPolicy":null}}' --type merge

Step 1: Check Remaining Rate Limit Quota

Use the Let's Encrypt rate limit checker before attempting ANY re-issuance:

# Check certs issued for your domain in the past 7 days via crt.sh
curl -s "https://crt.sh/?q=%.example.com&output=json" | \
  jq '[.[] | select(.not_before > (now - 604800 | todate))] | length'

If the count is ≥ 50, you must wait. The window is rolling 7 days.

Step 2: Switch to Staging for Diagnosis (Basic Fix)

 apiVersion: cert-manager.io/v1
 kind: ClusterIssuer
 metadata:
-  name: letsencrypt-prod
+  name: letsencrypt-staging
 spec:
   acme:
-    server: https://acme-v02.api.letsencrypt.org/directory
+    server: https://acme-staging-v02.api.letsencrypt.org/directory
     email: [email protected]
     privateKeySecretRef:
-      name: letsencrypt-prod-account-key
+      name: letsencrypt-staging-account-key
     solvers:
     - http01:
         ingress:
           class: nginx

Update your Certificate resources to reference letsencrypt-staging. Staging has no meaningful rate limits and is identical in behavior. Validate the full ACME flow here before switching back to production.

Step 3: Fix the Root Cause — Consolidate SANs (Enterprise Best Practice)

The most common cause of rate limit exhaustion is issuing one Certificate per subdomain instead of a single cert with multiple SANs.

 apiVersion: cert-manager.io/v1
 kind: Certificate
 metadata:
   name: example-com-tls
   namespace: production
 spec:
   secretName: example-com-tls
   issuerRef:
     name: letsencrypt-prod
     kind: ClusterIssuer
   dnsNames:
-    - api.example.com
-# (separate Certificate objects for each subdomain — burns rate limit)
+    - example.com
+    - api.example.com
+    - app.example.com
+    - dashboard.example.com
+    # Consolidate ALL subdomains into one Certificate = one issuance
   renewBefore: 360h # Renew 15 days before expiry, not on every deploy
+  duration: 2160h   # 90 days — explicit, prevents accidental short-lived cert loops

Step 4: Fix HTTP-01 Solver for Ingress Class Mismatch

If using HTTP-01 and validation keeps failing (burning the 5/hour failed validation limit):

 solvers:
 - http01:
     ingress:
-      class: nginx
+      ingressClassName: nginx  # cert-manager v1.5+ uses ingressClassName, not class
+      # OR use ingressTemplate for annotation-based ingress controllers:
+      ingressTemplate:
+        metadata:
+          annotations:
+            kubernetes.io/ingress.class: "nginx"

Verify the challenge pod is reachable:

# Find the challenge pod and test the HTTP-01 path manually
kubectl get challenges -A
CHALLENGE_TOKEN=$(kubectl get challenge -n <ns> <name> -o jsonpath='{.spec.token}')
curl -v http://api.example.com/.well-known/acme-challenge/$CHALLENGE_TOKEN

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. OPA/Gatekeeper Policy — Enforce SAN Consolidation

Block Certificate resources with fewer than 2 DNS names to force consolidation:

package certmanager.rateguard

deny[msg] {
  input.apiVersion == "cert-manager.io/v1"
  input.kind == "Certificate"
  count(input.spec.dnsNames) == 1
  not startswith(input.spec.dnsNames[0], "*.")  # allow explicit wildcards
  msg := sprintf(
    "Certificate '%v' has only 1 dnsName. Consolidate subdomains to prevent LE rate limits.",
    [input.metadata.name]
  )
}

2. Enforce Staging Issuer in Non-Production Namespaces

deny[msg] {
  input.kind == "Certificate"
  input.spec.issuerRef.name == "letsencrypt-prod"
  namespace := input.metadata.namespace
  not namespace == "production"
  msg := sprintf("Namespace '%v' must use letsencrypt-staging, not letsencrypt-prod.", [namespace])
}

3. Helm/Kustomize Lint with Conftest in CI

# .github/workflows/cert-lint.yaml
- name: Lint cert-manager manifests
  run: |
    helm template ./charts/app | conftest test - \
      --policy ./policies/certmanager/ \
      --namespace production

4. Alerting — Catch Rate Limits Before They Hit

# Prometheus alert on cert-manager order failures
- alert: CertManagerOrderFailureSpike
  expr: increase(certmanager_http_acme_client_request_count{status="429"}[10m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Let's Encrypt rate limit hit — cert issuance blocked"
    runbook: "https://your-wiki/cert-manager-rate-limit-runbook"

5. Never Delete Certificate Secrets in GitOps

The #1 cause of rate limit exhaustion in GitOps pipelines is pruning the tls Secret that cert-manager manages, forcing a full re-issuance on every sync. Add this annotation:

apiVersion: v1
kind: Secret
metadata:
  name: example-com-tls
  annotations:
    argocd.argoproj.io/managed-by: "cert-manager"  # prevents ArgoCD pruning
    helm.sh/resource-policy: keep

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →