Initializing Enclave...

Fixing Linkerd 'Proxy Injection Failed' Identity Trust Domain Mismatch in Production

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: The identityTrustDomain set in the Linkerd control plane (e.g., cluster.local) does not match the SPIFFE trust domain embedded in the identity issuer certificate (ca.crt), causing the proxy injector webhook to reject all annotated pods.
  • How to fix it: Re-align the --identity-trust-domain Helm value with the SAN URI in your root CA cert, then re-roll affected workloads. If the cert was generated with the wrong domain, you must rotate it.
  • Shortcut: Use our Client-Side Sandbox above to paste your linkerd-config ConfigMap and CA cert PEM — it will auto-diff the mismatch and generate the corrected Helm override without sending your certs anywhere.

The Incident (What Does the Error Mean?)

Raw error from kubectl describe pod or the injector webhook logs:

Error from server: error when creating "deploy.yaml":
admission webhook "linkerd-proxy-injector.linkerd.io" denied the request:
proxy injection failed: identity trust domain mismatch:
  issuer has domain "prod.example.com" but control plane expects "cluster.local"

Or from linkerd check:

× issuer cert is signed by the trust anchor
    issuer certificate is not signed by any of the trust anchors
    see https://linkerd.io/2/checks/#l5d-identity-issuer-cert-signed-by-trust-anchor

Immediate consequence: The proxy injector webhook — a validating/mutating admission controller — hard-blocks pod scheduling for every namespace with linkerd.io/inject: enabled. Your deployment rolls out zero replicas. In an existing cluster mid-upgrade, running pods lose the ability to renew their SPIFFE SVIDs, causing mTLS session failures within the SVID TTL window (default: 24h).


The Attack Vector / Blast Radius

This isn't just an ops nuisance — it's a mesh-wide identity collapse.

Why it's dangerous:

  1. mTLS falls back silently in some configurations. If proxy.defaultInboundPolicy is set to all-unauthenticated, workloads that were injected before the mismatch continue running but with no mutual TLS. An attacker with network access to the pod CIDR can intercept east-west traffic in plaintext.

  2. SPIFFE SVID chaining breaks. Linkerd's identity service issues X.509 SVIDs scoped to the trust domain (e.g., spiffe://cluster.local/ns/default/sa/myapp). A mismatched domain means the proxy cannot validate peer certificates against the trust bundle — all peer authentication policies evaluate to DENY or skip, depending on your Server and AuthorizationPolicy resources.

  3. Blast radius on upgrade: This most commonly surfaces during linkerd upgrade when a custom CA was generated with a hardcoded domain and the new Helm chart defaults differ. Every namespace with injection enabled is simultaneously affected. Rollback requires cert rotation, not just a Helm rollback.

  4. Audit gap: Because pods fail at admission, no workload logs are generated — the failure is invisible to application-level alerting. Only webhook audit logs or linkerd check catches it.


How to Fix It

Step 1: Confirm the Mismatch

# Extract the trust domain the control plane expects
kubectl -n linkerd get cm linkerd-config -o jsonpath='{.data.values}' | \
  python3 -c "import sys,json; v=json.load(sys.stdin); print(v['identityTrustDomain'])"

# Extract the trust domain burned into the issuer cert
kubectl -n linkerd get secret linkerd-identity-issuer \
  -o jsonpath='{.data.crt\.pem}' | base64 -d | \
  openssl x509 -noout -text | grep -A1 "Subject Alternative Name"

You will see something like:

# ConfigMap says:  cluster.local
# Cert SAN says:   URI:spiffe://prod.example.com

That delta is your outage.


Basic Fix — Align Helm Value to Existing Cert

If the cert was intentionally generated with prod.example.com and the Helm value is wrong:

# linkerd-values-override.yaml
 identity:
   issuer:
     scheme: kubernetes.io/tls
-  identityTrustDomain: cluster.local
+  identityTrustDomain: prod.example.com
helm upgrade linkerd-control-plane linkerd/linkerd-control-plane \
  -n linkerd \
  -f linkerd-values-override.yaml \
  --reuse-values

# Force restart the injector and identity controller
kubectl -n linkerd rollout restart deploy/linkerd-proxy-injector
kubectl -n linkerd rollout restart deploy/linkerd-identity

# Validate
linkerd check

Enterprise Best Practice — Rotate the CA to Match Cluster Convention

If you're standardizing on cluster.local (recommended for portability) and the cert is wrong, rotate the trust anchor. Do not skip the step-cli verification.

# Generate new root CA with correct trust domain
step certificate create root.linkerd.cluster.local ca.crt ca.key \
  --profile root-ca \
  --no-password \
  --insecure \
  --san "root.linkerd.cluster.local"

# Generate issuer cert signed by new root
step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
  --profile intermediate-ca \
  --not-after 8760h \
  --no-password \
  --insecure \
  --ca ca.crt \
  --ca-key ca.key
# Helm upgrade with explicit cert injection
 identity:
   issuer:
     scheme: kubernetes.io/tls
+  identityTrustDomain: cluster.local
+  identityTrustAnchorsPEM: |
+    <contents of ca.crt>
helm upgrade linkerd-control-plane linkerd/linkerd-control-plane \
  -n linkerd \
  --set-file identityTrustAnchorsPEM=ca.crt \
  --set identity.issuer.tls.crtPEM="$(cat issuer.crt)" \
  --set identity.issuer.tls.keyPEM="$(cat issuer.key)" \
  --set identityTrustDomain=cluster.local \
  --reuse-values

# Re-roll ALL injected workloads to get new SVIDs
for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  kubectl -n $ns rollout restart deploy 2>/dev/null
done

linkerd check --proxy

⚠️ During the rotation window, pods with old SVIDs (signed by the old CA) and pods with new SVIDs cannot mutually authenticate. Schedule this in a maintenance window or use Linkerd's trust anchor rotation procedure which supports a dual-trust-anchor bundle.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Pre-Flight linkerd check in Your Deploy Pipeline

# .github/workflows/deploy.yaml (or equivalent)
- name: Linkerd pre-flight check
  run: |
    linkerd check --pre 2>&1 | tee /tmp/linkerd-check.log
    if grep -q "×" /tmp/linkerd-check.log; then
      echo "Linkerd control plane check failed. Blocking deploy."
      exit 1
    fi

2. OPA/Gatekeeper Policy — Enforce Trust Domain Annotation Consistency

# opa-linkerd-trustdomain.rego
package linkerd.trustdomain

violation[{"msg": msg}] {
  input.review.object.kind == "ConfigMap"
  input.review.object.metadata.name == "linkerd-config"
  input.review.object.metadata.namespace == "linkerd"
  domain := input.review.object.data.values
  not contains(domain, "identityTrustDomain\":\"cluster.local")
  msg := sprintf("linkerd-config identityTrustDomain must be cluster.local, got: %v", [domain])
}

3. cert-manager + Trust Domain Pinning

Use cert-manager with a ClusterIssuer to enforce the correct SAN on every generated cert:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: linkerd-identity-issuer
  namespace: linkerd
spec:
  secretName: linkerd-identity-issuer
  duration: 8760h
  renewBefore: 720h
  isCA: true
  privateKey:
    algorithm: ECDSA
  dnsNames:
    - identity.linkerd.cluster.local
  uris:
    - spiffe://cluster.local  # <-- THIS must match identityTrustDomain
  issuerRef:
    name: linkerd-trust-anchor
    kind: ClusterIssuer

4. Checkov / Helm Chart Linting

# Render the chart and scan for trust domain consistency
helm template linkerd-control-plane linkerd/linkerd-control-plane \
  -f values.yaml > rendered.yaml

checkov -f rendered.yaml --check CKV2_K8S_6

# Custom script: cross-check rendered trust domain vs. CA cert SAN
python3 scripts/validate_linkerd_trust_domain.py rendered.yaml ca.crt

Pin this validation in your Helm pre-upgrade hook and your GitOps reconciliation loop (Flux/ArgoCD pre-sync hook). A 30-second check here prevents a 30-minute outage.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →