Initializing Enclave...

Fixing cert-manager CAA Record Validation Failures: ACME Challenge Pending Solved

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15 mins

TL;DR

  • What broke: cert-manager's ACME HTTP-01 or DNS-01 challenge is permanently pending because your domain's CAA DNS records do not authorize Let's Encrypt (or whichever CA your Issuer references) to issue certificates.
  • How to fix it: Add the correct issue or issuewild CAA record for your CA (e.g., letsencrypt.org) to your DNS zone, or remove the overly restrictive existing CAA record.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your DNS zone file and cert-manager Certificate manifest.

The Incident (What does the error mean?)

Raw event output from kubectl describe challenge <challenge-name> -n cert-manager:

Status:
  Reason: CAA record for "example.com" prevents issuance by this CA
  State:  pending
Events:
  Warning  PresentError  4m   cert-manager  Error presenting challenge: CAA record for "example.com" does not permit issuance by "letsencrypt.org"
  Warning  Failed        2m   cert-manager  Accepting challenge authorization failed: acme: authorization error for example.com: 403 urn:ietf:params:acme:error:caa :: CAA record for example.com prevents issuance

Immediate consequence: The Certificate object stays in False / Issuing state indefinitely. Any Ingress or workload relying on this TLS secret gets no cert — or worse, the existing cert expires unrenewed, causing a hard TLS handshake failure for all inbound traffic.


The Attack Vector / Blast Radius

CAA records (RFC 8659) are a DNS-layer CA authorization control. When misconfigured, they silently brick your entire automated cert lifecycle:

  • Expired cert cascade: cert-manager will retry on its backoff schedule but never succeed. Your existing secret goes stale. HTTPS breaks site-wide.
  • Multi-cluster blast radius: If you use a shared DNS zone across clusters (common in hub-spoke EKS/GKE setups), a single bad CAA record blocks cert issuance across all clusters in that zone.
  • Wildcard vs. apex mismatch: issue controls apex + SAN certs; issuewild controls *.example.com. Missing either one for the correct CA breaks the corresponding cert type independently — a trap that burns engineers who only fix one.
  • Invisible in kubectl: The failure lives in the ACME CA's authorization response, not in Kubernetes events until cert-manager surfaces it. Engineers waste time checking Ingress controllers and network policies first.

How to Fix It (The Solution)

Basic Fix — Add the Missing CAA Record

Check your current CAA records:

dig CAA example.com +short
# or
nslookup -type=CAA example.com

If the output is empty or lists only a different CA (e.g., digicert.com), you need to add the Let's Encrypt entry.

DNS Zone file (BIND format):

- example.com.  300  IN  CAA  0 issue "digicert.com"
+ example.com.  300  IN  CAA  0 issue "digicert.com"
+ example.com.  300  IN  CAA  0 issue "letsencrypt.org"
+ example.com.  300  IN  CAA  0 issuewild "letsencrypt.org"

If you use Route53, Cloudflare, or GCP Cloud DNS — add these via the console or IaC. Do NOT just edit a zone file and assume it propagates.

After updating, force cert-manager to re-attempt:

# Delete the failed Order to trigger a new ACME authorization cycle
kubectl delete order -n <namespace> <order-name>
# cert-manager will recreate it automatically from the CertificateRequest

Enterprise Best Practice — IaC-Managed CAA + Issuer Validation

Terraform (Route53 example):

 resource "aws_route53_record" "caa" {
   zone_id = var.hosted_zone_id
   name    = "example.com"
   type    = "CAA"
   ttl     = 300
   records = [
-    "0 issue \"digicert.com\"",
+    "0 issue \"digicert.com\"",
+    "0 issue \"letsencrypt.org\"",
+    "0 issuewild \"letsencrypt.org\"",
+    "0 iodef \"mailto:[email protected]\"",
   ]
 }

cert-manager Certificate manifest — verify your Issuer matches the CAA-authorized CA:

 apiVersion: cert-manager.io/v1
 kind: Certificate
 metadata:
   name: example-tls
   namespace: production
 spec:
   secretName: example-tls-secret
   dnsNames:
     - example.com
     - "*.example.com"
   issuerRef:
-    name: selfsigned-issuer
-    kind: Issuer
+    name: letsencrypt-prod
+    kind: ClusterIssuer

ClusterIssuer referencing Let's Encrypt production (DNS-01 with Route53):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - dns01:
          route53:
            region: us-east-1
            hostedZoneID: ZXXXXXXXXXXXXX

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Pre-deploy DNS validation in your pipeline:

#!/bin/bash
# ci/validate-caa.sh — run before any cert-manager Certificate apply
DOMAIN="example.com"
REQUIRED_CA="letsencrypt.org"

CAA_RECORDS=$(dig CAA "$DOMAIN" +short)
if ! echo "$CAA_RECORDS" | grep -q "$REQUIRED_CA"; then
  echo "FATAL: CAA record for $DOMAIN does not authorize $REQUIRED_CA"
  exit 1
fi
echo "CAA validation passed."

2. OPA/Gatekeeper policy — block Certificate manifests that reference issuers not in the approved list:

package certmanager.caa

allowed_issuers := {"letsencrypt-prod", "letsencrypt-staging"}

violation[{"msg": msg}] {
  input.kind == "Certificate"
  issuer := input.spec.issuerRef.name
  not allowed_issuers[issuer]
  msg := sprintf("Certificate issuer '%v' is not in the CAA-authorized issuer list.", [issuer])
}

3. Checkov custom check — scan Terraform Route53 records and fail the plan if no CAA record exists for the zone.

4. AlertManager rule — alert on cert-manager certificate_expiration_timestamp_seconds dropping below 72h with ready=False to catch silent renewal failures before expiry.

5. DNS TTL discipline: Keep CAA record TTL at 300s (5 min) during active migrations. A 3600s TTL on a wrong CAA record means an hour of blocked issuance before your fix propagates.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →