Fixing cert-manager CAA Record Validation Failures: ACME Challenge Pending Solved
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15 mins
TL;DR
- What broke: cert-manager's ACME HTTP-01 or DNS-01 challenge is permanently pending because your domain's CAA DNS records do not authorize Let's Encrypt (or whichever CA your Issuer references) to issue certificates.
- How to fix it: Add the correct
issueorissuewildCAA record for your CA (e.g.,letsencrypt.org) to your DNS zone, or remove the overly restrictive existing CAA record. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your DNS zone file and cert-manager Certificate manifest.
The Incident (What does the error mean?)
Raw event output from kubectl describe challenge <challenge-name> -n cert-manager:
Status:
Reason: CAA record for "example.com" prevents issuance by this CA
State: pending
Events:
Warning PresentError 4m cert-manager Error presenting challenge: CAA record for "example.com" does not permit issuance by "letsencrypt.org"
Warning Failed 2m cert-manager Accepting challenge authorization failed: acme: authorization error for example.com: 403 urn:ietf:params:acme:error:caa :: CAA record for example.com prevents issuance
Immediate consequence: The Certificate object stays in False / Issuing state indefinitely. Any Ingress or workload relying on this TLS secret gets no cert — or worse, the existing cert expires unrenewed, causing a hard TLS handshake failure for all inbound traffic.
The Attack Vector / Blast Radius
CAA records (RFC 8659) are a DNS-layer CA authorization control. When misconfigured, they silently brick your entire automated cert lifecycle:
- Expired cert cascade: cert-manager will retry on its backoff schedule but never succeed. Your existing secret goes stale. HTTPS breaks site-wide.
- Multi-cluster blast radius: If you use a shared DNS zone across clusters (common in hub-spoke EKS/GKE setups), a single bad CAA record blocks cert issuance across all clusters in that zone.
- Wildcard vs. apex mismatch:
issuecontrols apex + SAN certs;issuewildcontrols*.example.com. Missing either one for the correct CA breaks the corresponding cert type independently — a trap that burns engineers who only fix one. - Invisible in
kubectl: The failure lives in the ACME CA's authorization response, not in Kubernetes events until cert-manager surfaces it. Engineers waste time checking Ingress controllers and network policies first.
How to Fix It (The Solution)
Basic Fix — Add the Missing CAA Record
Check your current CAA records:
dig CAA example.com +short
# or
nslookup -type=CAA example.com
If the output is empty or lists only a different CA (e.g., digicert.com), you need to add the Let's Encrypt entry.
DNS Zone file (BIND format):
- example.com. 300 IN CAA 0 issue "digicert.com"
+ example.com. 300 IN CAA 0 issue "digicert.com"
+ example.com. 300 IN CAA 0 issue "letsencrypt.org"
+ example.com. 300 IN CAA 0 issuewild "letsencrypt.org"
If you use Route53, Cloudflare, or GCP Cloud DNS — add these via the console or IaC. Do NOT just edit a zone file and assume it propagates.
After updating, force cert-manager to re-attempt:
# Delete the failed Order to trigger a new ACME authorization cycle
kubectl delete order -n <namespace> <order-name>
# cert-manager will recreate it automatically from the CertificateRequest
Enterprise Best Practice — IaC-Managed CAA + Issuer Validation
Terraform (Route53 example):
resource "aws_route53_record" "caa" {
zone_id = var.hosted_zone_id
name = "example.com"
type = "CAA"
ttl = 300
records = [
- "0 issue \"digicert.com\"",
+ "0 issue \"digicert.com\"",
+ "0 issue \"letsencrypt.org\"",
+ "0 issuewild \"letsencrypt.org\"",
+ "0 iodef \"mailto:[email protected]\"",
]
}
cert-manager Certificate manifest — verify your Issuer matches the CAA-authorized CA:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-tls
namespace: production
spec:
secretName: example-tls-secret
dnsNames:
- example.com
- "*.example.com"
issuerRef:
- name: selfsigned-issuer
- kind: Issuer
+ name: letsencrypt-prod
+ kind: ClusterIssuer
ClusterIssuer referencing Let's Encrypt production (DNS-01 with Route53):
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: [email protected]
privateKeySecretRef:
name: letsencrypt-prod-account-key
solvers:
- dns01:
route53:
region: us-east-1
hostedZoneID: ZXXXXXXXXXXXXX
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Pre-deploy DNS validation in your pipeline:
#!/bin/bash
# ci/validate-caa.sh — run before any cert-manager Certificate apply
DOMAIN="example.com"
REQUIRED_CA="letsencrypt.org"
CAA_RECORDS=$(dig CAA "$DOMAIN" +short)
if ! echo "$CAA_RECORDS" | grep -q "$REQUIRED_CA"; then
echo "FATAL: CAA record for $DOMAIN does not authorize $REQUIRED_CA"
exit 1
fi
echo "CAA validation passed."
2. OPA/Gatekeeper policy — block Certificate manifests that reference issuers not in the approved list:
package certmanager.caa
allowed_issuers := {"letsencrypt-prod", "letsencrypt-staging"}
violation[{"msg": msg}] {
input.kind == "Certificate"
issuer := input.spec.issuerRef.name
not allowed_issuers[issuer]
msg := sprintf("Certificate issuer '%v' is not in the CAA-authorized issuer list.", [issuer])
}
3. Checkov custom check — scan Terraform Route53 records and fail the plan if no CAA record exists for the zone.
4. AlertManager rule — alert on cert-manager certificate_expiration_timestamp_seconds dropping below 72h with ready=False to catch silent renewal failures before expiry.
5. DNS TTL discipline: Keep CAA record TTL at 300s (5 min) during active migrations. A 3600s TTL on a wrong CAA record means an hour of blocked issuance before your fix propagates.