Initializing Enclave...

How to Fix AWS Load Balancer Controller Webhook Certificate Expired Error in EKS

Threat/Impact Level: CRITICAL | Downtime Risk: HIGH | Time to Fix: 10–20 mins


TL;DR

  • What broke: The TLS certificate backing the aws-load-balancer-webhook-service expired. Kubernetes rejects all webhook calls, blocking every Ingress and Service of type LoadBalancer from being created or updated.
  • How to fix it: Delete the expired aws-load-balancer-tls secret, re-run cert generation (via cert-manager or manual openssl), and restart the controller pod to force re-registration.
  • Fast path: Use our Client-Side Sandbox above to paste your kubectl describe mutatingwebhookconfiguration output and auto-generate the exact rotation commands for your cluster.

The Incident (What Does the Error Mean?)

You'll see one or more of these in your pod events or kubectl apply output:

Error from server (InternalError): error when creating "ingress.yaml":
  Internal error occurred: failed calling webhook
  "mwebhook.elbv2.k8s.aws": failed to call webhook:
  Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s":
  x509: certificate has expired or is not yet valid:
  current time 2024-11-01T10:22:31Z is after 2024-06-15T00:00:00Z

or in controller logs:

tls: no certificates configured
WebhookServer: unable to start server: tls handshake error

Immediate consequence: Every Ingress resource and every LoadBalancer-type Service mutation is rejected by the API server. Running workloads keep their existing ALBs/NLBs, but zero new provisioning or updates work. Deployments that trigger service mutations will hang or fail. This is a hard cluster-level block, not a per-namespace issue.


The Attack Vector / Blast Radius

This is not a slow degradation — it is a binary failure. The Kubernetes API server enforces the webhook call before persisting any object mutation. failurePolicy: Fail (the default in the AWS LBC Helm chart) means the API server will reject the request entirely rather than allow it through.

Blast radius:

Affected Resource Impact
New Ingress objects 500 Internal Error on creation
Service type LoadBalancer Cannot be created or patched
HPA / rollout triggers that touch services Blocked
GitOps reconciliation (ArgoCD/Flux) Sync loops fail, alert storm
Cluster autoscaler node registration Unaffected (different path)

The secondary blast radius is operational: every engineer on-call starts chasing ghost issues in their application because the error surfaces at the kubectl apply layer, not in app logs. The cert expiry is frequently misdiagnosed as an IAM or IRSA issue for 20–40 minutes.

Security note: If your team's response is to set failurePolicy: Ignore as a workaround, you have now disabled admission control for load balancer resources entirely. Malformed or malicious Ingress annotations will pass through without validation.


How to Fix It

Step 1 — Confirm the expiry

# Check the certificate expiry in the secret
kubectl get secret aws-load-balancer-tls \
  -n kube-system \
  -o jsonpath='{.data.tls\.crt}' \
  | base64 -d \
  | openssl x509 -noout -dates

# Check the webhook config CA bundle
kubectl get mutatingwebhookconfiguration \
  aws-load-balancer-mutating-webhook-configuration \
  -o jsonpath='{.webhooks[0].clientConfig.caBundle}' \
  | base64 -d \
  | openssl x509 -noout -dates

Basic Fix — Manual Certificate Rotation

Use this if you are not running cert-manager.

# 1. Generate a new self-signed cert (10-year validity; adjust to policy)
OPENSSL_SAN="dns:aws-load-balancer-webhook-service.kube-system.svc,\
dns:aws-load-balancer-webhook-service.kube-system.svc.cluster.local"

openssl req -x509 -newkey rsa:4096 -nodes \
  -keyout tls.key -out tls.crt \
  -days 3650 \
  -subj "/CN=aws-load-balancer-webhook-service.kube-system.svc" \
  -addext "subjectAltName=${OPENSSL_SAN}"

# 2. Delete and recreate the secret
kubectl delete secret aws-load-balancer-tls -n kube-system
kubectl create secret tls aws-load-balancer-tls \
  -n kube-system \
  --cert=tls.crt \
  --key=tls.key

# 3. Patch the webhook CA bundle
CA_BUNDLE=$(base64 -w0 < tls.crt)
kubectl patch mutatingwebhookconfiguration \
  aws-load-balancer-mutating-webhook-configuration \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/webhooks/0/clientConfig/caBundle\",\"value\":\"${CA_BUNDLE}\"}]"

kubectl patch validatingwebhookconfiguration \
  aws-load-balancer-validating-webhook-configuration \
  --type='json' \
  -p="[{\"op\":\"replace\",\"path\":\"/webhooks/0/clientConfig/caBundle\",\"value\":\"${CA_BUNDLE}\"}]"

# 4. Restart the controller
kubectl rollout restart deployment aws-load-balancer-controller -n kube-system
kubectl rollout status deployment aws-load-balancer-controller -n kube-system

Enterprise Best Practice — cert-manager Automated Rotation

This is the only acceptable long-term solution. Manual cert rotation is a toil trap.

# helm/values-aws-lbc.yaml

- # No cert-manager integration; using manually managed secret
- webhookTLS:
-   secretName: aws-load-balancer-tls

+ # cert-manager issues and auto-rotates the webhook certificate
+ enableCertManager: true
+
+ # Ensure cert-manager is installed before this chart
+ # helm repo add jetstack https://charts.jetstack.io
+ # helm install cert-manager jetstack/cert-manager \
+ #   --namespace cert-manager --create-namespace \
+ #   --set installCRDs=true
# cert-manager Certificate resource (if managing manually via CRD)

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: aws-load-balancer-serving-cert
  namespace: kube-system
spec:
  dnsNames:
-   - aws-load-balancer-webhook-service
+   - aws-load-balancer-webhook-service.kube-system.svc
+   - aws-load-balancer-webhook-service.kube-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: aws-load-balancer-selfsigned-issuer
- duration: 8760h   # 1 year, manual rotation required
+ duration: 8760h
+ renewBefore: 720h  # Renew 30 days before expiry — cert-manager handles this automatically
  secretName: aws-load-balancer-tls

With cert-manager and renewBefore set, rotation is fully automated. The controller picks up the new secret on the next reconcile cycle without a pod restart.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Certificate expiry alerting (Prometheus)

# prometheus-rules.yaml
groups:
  - name: eks-webhook-cert
    rules:
      - alert: EKSWebhookCertExpiringSoon
        expr: |
          (certmanager_certificate_expiration_timestamp_seconds
            {name="aws-load-balancer-serving-cert"} - time()) / 86400 < 30
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "AWS LBC webhook cert expires in < 30 days"

2. Block non-cert-manager LBC deployments with OPA/Gatekeeper

# opa/lbc-certmanager-required.rego
package lbc

deny[msg] {
  input.review.object.kind == "HelmRelease"
  input.review.object.spec.chart.spec.chart == "aws-load-balancer-controller"
  not input.review.object.spec.values.enableCertManager == true
  msg := "AWS LBC HelmRelease must have enableCertManager: true"
}

3. Checkov IaC scan in pipeline

# .github/workflows/checkov.yaml (relevant step)
- name: Scan Helm values for LBC misconfig
  run: |
    checkov -d ./helm \
      --framework kubernetes \
      --check CKV_K8S_WEBHOOK_CERT_MANAGED

4. Helm upgrade in ArgoCD — force cert-manager dependency

# argocd-application.yaml
spec:
  source:
    helm:
      values: |
        enableCertManager: true
  syncPolicy:
    syncOptions:
      - RespectIgnoreDifferences=true
    retry:
      limit: 3

Bottom line: Any cluster running AWS Load Balancer Controller without cert-manager managing the webhook TLS is a scheduled outage waiting to happen. The default Helm chart cert is valid for 1 year. Most teams discover this at month 13, in production, at 2 AM.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →