Fixing Cert-Manager 'Certificate Request Failed' Due to Let's Encrypt Rate Limits
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–60 mins (rate limit window may force a 1-hour to 7-day wait)
TL;DR
- What broke: Cert-Manager is repeatedly submitting ACME certificate orders to Let's Encrypt, exhausting the rate limit (5 failed validations/hostname/hour, 50 certs/registered domain/week). All new
CertificateRequestobjects are rejected with429 Too Many Requests. - How to fix it: Stop the bleeding immediately by patching the failing
Certificateresource to suspend retries, diagnose the root cause (wrong solver, missing DNS record, misconfigured ingress), switch to the Let's Encrypt Staging endpoint to test, then re-issue against production once validated. - Use our Client-Side Sandbox below to paste your failing
CertificateorClusterIssuerYAML and auto-refactor it without leaking your domain or ACME account key.
The Incident (What Does the Error Mean?)
Raw error from kubectl describe certificaterequest <name> -n <namespace>:
Status:
Conditions:
Message: Failed to wait for order resource "example-com-xyz" to become ready:
order is in "invalid" state:
Failed to finalize order: 429 urn:ietf:params:acme:error:rateLimited ::
Error creating new order :: too many certificates already issued for
registered domain "example.com": see https://letsencrypt.org/docs/rate-limits/
Reason: Failed
Type: Ready False
And from cert-manager controller logs (kubectl logs -n cert-manager deploy/cert-manager):
E0610 14:32:01.123456 1 sync.go:182] cert-manager/controller/orders
"msg"="Failed to determine the list of Challenge resources needed for the Order"
"error"="429: too many certificates already issued for: example.com"
Immediate consequence: Every Certificate resource targeting the affected registered domain is stuck in False/Ready. Ingress TLS termination fails for new deployments. Existing certificates continue serving until expiry — but any rotation attempt will also fail for up to 7 days (the weekly issuance window).
The Attack Vector / Blast Radius
This is not a security exploit — it is a self-inflicted denial-of-service against your own certificate pipeline. The blast radius cascades as follows:
- Helm chart re-deploys or GitOps reconciliation loops (ArgoCD, Flux) repeatedly delete and recreate
Certificateobjects, each triggering a new ACME Order. 50 certs/registered domain/week is consumed in minutes. - Wildcard vs. SAN misconfiguration causes cert-manager to request
*.example.comANDexample.comas separate orders instead of a single SAN certificate — doubling consumption. - HTTP-01 solver misconfiguration (wrong ingress class, missing
/.well-known/acme-challenge/route) causes every validation attempt to fail. Let's Encrypt's 5 failed validations per hostname per hour limit triggers independently and stacks on top of the issuance limit. - Once rate-limited, all subdomains under the registered domain (
api.example.com,app.example.com) are blocked — not just the offending hostname. - Staging environments sharing the same registered domain as production will consume the same production rate limit quota if pointed at the production ACME endpoint.
How to Fix It
Step 0: Stop the Bleeding (Immediate)
Patch cert-manager to stop retrying immediately. Identify the failing Certificate and annotate it to prevent re-queuing:
# Identify all failing Certificate resources
kubectl get certificates -A | grep -v True
# Delete the stuck Order to stop retry loop (cert-manager will NOT re-issue if rate limited)
kubectl delete order -n <namespace> <order-name>
# If ArgoCD/Flux is re-creating it, suspend the Application temporarily
argocd app patch <app-name> --patch '{"spec":{"syncPolicy":null}}' --type merge
Step 1: Check Remaining Rate Limit Quota
Use the Let's Encrypt rate limit checker before attempting ANY re-issuance:
# Check certs issued for your domain in the past 7 days via crt.sh
curl -s "https://crt.sh/?q=%.example.com&output=json" | \
jq '[.[] | select(.not_before > (now - 604800 | todate))] | length'
If the count is ≥ 50, you must wait. The window is rolling 7 days.
Step 2: Switch to Staging for Diagnosis (Basic Fix)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
- name: letsencrypt-prod
+ name: letsencrypt-staging
spec:
acme:
- server: https://acme-v02.api.letsencrypt.org/directory
+ server: https://acme-staging-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
- name: letsencrypt-prod-account-key
+ name: letsencrypt-staging-account-key
solvers:
- http01:
ingress:
class: nginx
Update your Certificate resources to reference letsencrypt-staging. Staging has no meaningful rate limits and is identical in behavior. Validate the full ACME flow here before switching back to production.
Step 3: Fix the Root Cause — Consolidate SANs (Enterprise Best Practice)
The most common cause of rate limit exhaustion is issuing one Certificate per subdomain instead of a single cert with multiple SANs.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-com-tls
namespace: production
spec:
secretName: example-com-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- - api.example.com
-# (separate Certificate objects for each subdomain — burns rate limit)
+ - example.com
+ - api.example.com
+ - app.example.com
+ - dashboard.example.com
+ # Consolidate ALL subdomains into one Certificate = one issuance
renewBefore: 360h # Renew 15 days before expiry, not on every deploy
+ duration: 2160h # 90 days — explicit, prevents accidental short-lived cert loops
Step 4: Fix HTTP-01 Solver for Ingress Class Mismatch
If using HTTP-01 and validation keeps failing (burning the 5/hour failed validation limit):
solvers:
- http01:
ingress:
- class: nginx
+ ingressClassName: nginx # cert-manager v1.5+ uses ingressClassName, not class
+ # OR use ingressTemplate for annotation-based ingress controllers:
+ ingressTemplate:
+ metadata:
+ annotations:
+ kubernetes.io/ingress.class: "nginx"
Verify the challenge pod is reachable:
# Find the challenge pod and test the HTTP-01 path manually
kubectl get challenges -A
CHALLENGE_TOKEN=$(kubectl get challenge -n <ns> <name> -o jsonpath='{.spec.token}')
curl -v http://api.example.com/.well-known/acme-challenge/$CHALLENGE_TOKEN
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. OPA/Gatekeeper Policy — Enforce SAN Consolidation
Block Certificate resources with fewer than 2 DNS names to force consolidation:
package certmanager.rateguard
deny[msg] {
input.apiVersion == "cert-manager.io/v1"
input.kind == "Certificate"
count(input.spec.dnsNames) == 1
not startswith(input.spec.dnsNames[0], "*.") # allow explicit wildcards
msg := sprintf(
"Certificate '%v' has only 1 dnsName. Consolidate subdomains to prevent LE rate limits.",
[input.metadata.name]
)
}
2. Enforce Staging Issuer in Non-Production Namespaces
deny[msg] {
input.kind == "Certificate"
input.spec.issuerRef.name == "letsencrypt-prod"
namespace := input.metadata.namespace
not namespace == "production"
msg := sprintf("Namespace '%v' must use letsencrypt-staging, not letsencrypt-prod.", [namespace])
}
3. Helm/Kustomize Lint with Conftest in CI
# .github/workflows/cert-lint.yaml
- name: Lint cert-manager manifests
run: |
helm template ./charts/app | conftest test - \
--policy ./policies/certmanager/ \
--namespace production
4. Alerting — Catch Rate Limits Before They Hit
# Prometheus alert on cert-manager order failures
- alert: CertManagerOrderFailureSpike
expr: increase(certmanager_http_acme_client_request_count{status="429"}[10m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Let's Encrypt rate limit hit — cert issuance blocked"
runbook: "https://your-wiki/cert-manager-rate-limit-runbook"
5. Never Delete Certificate Secrets in GitOps
The #1 cause of rate limit exhaustion in GitOps pipelines is pruning the tls Secret that cert-manager manages, forcing a full re-issuance on every sync. Add this annotation:
apiVersion: v1
kind: Secret
metadata:
name: example-com-tls
annotations:
argocd.argoproj.io/managed-by: "cert-manager" # prevents ArgoCD pruning
helm.sh/resource-policy: keep