How to Fix AWS Load Balancer Controller Webhook Certificate Expired Error in EKS
Threat/Impact Level: CRITICAL | Downtime Risk: HIGH | Time to Fix: 10–20 mins
TL;DR
- What broke: The TLS certificate backing the
aws-load-balancer-webhook-serviceexpired. Kubernetes rejects all webhook calls, blocking every Ingress andServiceof typeLoadBalancerfrom being created or updated. - How to fix it: Delete the expired
aws-load-balancer-tlssecret, re-run cert generation (viacert-manageror manualopenssl), and restart the controller pod to force re-registration. - Fast path: Use our Client-Side Sandbox above to paste your
kubectl describe mutatingwebhookconfigurationoutput and auto-generate the exact rotation commands for your cluster.
The Incident (What Does the Error Mean?)
You'll see one or more of these in your pod events or kubectl apply output:
Error from server (InternalError): error when creating "ingress.yaml":
Internal error occurred: failed calling webhook
"mwebhook.elbv2.k8s.aws": failed to call webhook:
Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s":
x509: certificate has expired or is not yet valid:
current time 2024-11-01T10:22:31Z is after 2024-06-15T00:00:00Z
or in controller logs:
tls: no certificates configured
WebhookServer: unable to start server: tls handshake error
Immediate consequence: Every Ingress resource and every LoadBalancer-type Service mutation is rejected by the API server. Running workloads keep their existing ALBs/NLBs, but zero new provisioning or updates work. Deployments that trigger service mutations will hang or fail. This is a hard cluster-level block, not a per-namespace issue.
The Attack Vector / Blast Radius
This is not a slow degradation — it is a binary failure. The Kubernetes API server enforces the webhook call before persisting any object mutation. failurePolicy: Fail (the default in the AWS LBC Helm chart) means the API server will reject the request entirely rather than allow it through.
Blast radius:
| Affected Resource | Impact |
|---|---|
New Ingress objects |
500 Internal Error on creation |
Service type LoadBalancer |
Cannot be created or patched |
| HPA / rollout triggers that touch services | Blocked |
| GitOps reconciliation (ArgoCD/Flux) | Sync loops fail, alert storm |
| Cluster autoscaler node registration | Unaffected (different path) |
The secondary blast radius is operational: every engineer on-call starts chasing ghost issues in their application because the error surfaces at the kubectl apply layer, not in app logs. The cert expiry is frequently misdiagnosed as an IAM or IRSA issue for 20–40 minutes.
Security note: If your team's response is to set failurePolicy: Ignore as a workaround, you have now disabled admission control for load balancer resources entirely. Malformed or malicious Ingress annotations will pass through without validation.
How to Fix It
Step 1 — Confirm the expiry
# Check the certificate expiry in the secret
kubectl get secret aws-load-balancer-tls \
-n kube-system \
-o jsonpath='{.data.tls\.crt}' \
| base64 -d \
| openssl x509 -noout -dates
# Check the webhook config CA bundle
kubectl get mutatingwebhookconfiguration \
aws-load-balancer-mutating-webhook-configuration \
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' \
| base64 -d \
| openssl x509 -noout -dates
Basic Fix — Manual Certificate Rotation
Use this if you are not running cert-manager.
# 1. Generate a new self-signed cert (10-year validity; adjust to policy)
OPENSSL_SAN="dns:aws-load-balancer-webhook-service.kube-system.svc,\
dns:aws-load-balancer-webhook-service.kube-system.svc.cluster.local"
openssl req -x509 -newkey rsa:4096 -nodes \
-keyout tls.key -out tls.crt \
-days 3650 \
-subj "/CN=aws-load-balancer-webhook-service.kube-system.svc" \
-addext "subjectAltName=${OPENSSL_SAN}"
# 2. Delete and recreate the secret
kubectl delete secret aws-load-balancer-tls -n kube-system
kubectl create secret tls aws-load-balancer-tls \
-n kube-system \
--cert=tls.crt \
--key=tls.key
# 3. Patch the webhook CA bundle
CA_BUNDLE=$(base64 -w0 < tls.crt)
kubectl patch mutatingwebhookconfiguration \
aws-load-balancer-mutating-webhook-configuration \
--type='json' \
-p="[{\"op\":\"replace\",\"path\":\"/webhooks/0/clientConfig/caBundle\",\"value\":\"${CA_BUNDLE}\"}]"
kubectl patch validatingwebhookconfiguration \
aws-load-balancer-validating-webhook-configuration \
--type='json' \
-p="[{\"op\":\"replace\",\"path\":\"/webhooks/0/clientConfig/caBundle\",\"value\":\"${CA_BUNDLE}\"}]"
# 4. Restart the controller
kubectl rollout restart deployment aws-load-balancer-controller -n kube-system
kubectl rollout status deployment aws-load-balancer-controller -n kube-system
Enterprise Best Practice — cert-manager Automated Rotation
This is the only acceptable long-term solution. Manual cert rotation is a toil trap.
# helm/values-aws-lbc.yaml
- # No cert-manager integration; using manually managed secret
- webhookTLS:
- secretName: aws-load-balancer-tls
+ # cert-manager issues and auto-rotates the webhook certificate
+ enableCertManager: true
+
+ # Ensure cert-manager is installed before this chart
+ # helm repo add jetstack https://charts.jetstack.io
+ # helm install cert-manager jetstack/cert-manager \
+ # --namespace cert-manager --create-namespace \
+ # --set installCRDs=true
# cert-manager Certificate resource (if managing manually via CRD)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: aws-load-balancer-serving-cert
namespace: kube-system
spec:
dnsNames:
- - aws-load-balancer-webhook-service
+ - aws-load-balancer-webhook-service.kube-system.svc
+ - aws-load-balancer-webhook-service.kube-system.svc.cluster.local
issuerRef:
kind: Issuer
name: aws-load-balancer-selfsigned-issuer
- duration: 8760h # 1 year, manual rotation required
+ duration: 8760h
+ renewBefore: 720h # Renew 30 days before expiry — cert-manager handles this automatically
secretName: aws-load-balancer-tls
With cert-manager and renewBefore set, rotation is fully automated. The controller picks up the new secret on the next reconcile cycle without a pod restart.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Certificate expiry alerting (Prometheus)
# prometheus-rules.yaml
groups:
- name: eks-webhook-cert
rules:
- alert: EKSWebhookCertExpiringSoon
expr: |
(certmanager_certificate_expiration_timestamp_seconds
{name="aws-load-balancer-serving-cert"} - time()) / 86400 < 30
for: 1h
labels:
severity: critical
annotations:
summary: "AWS LBC webhook cert expires in < 30 days"
2. Block non-cert-manager LBC deployments with OPA/Gatekeeper
# opa/lbc-certmanager-required.rego
package lbc
deny[msg] {
input.review.object.kind == "HelmRelease"
input.review.object.spec.chart.spec.chart == "aws-load-balancer-controller"
not input.review.object.spec.values.enableCertManager == true
msg := "AWS LBC HelmRelease must have enableCertManager: true"
}
3. Checkov IaC scan in pipeline
# .github/workflows/checkov.yaml (relevant step)
- name: Scan Helm values for LBC misconfig
run: |
checkov -d ./helm \
--framework kubernetes \
--check CKV_K8S_WEBHOOK_CERT_MANAGED
4. Helm upgrade in ArgoCD — force cert-manager dependency
# argocd-application.yaml
spec:
source:
helm:
values: |
enableCertManager: true
syncPolicy:
syncOptions:
- RespectIgnoreDifferences=true
retry:
limit: 3
Bottom line: Any cluster running AWS Load Balancer Controller without cert-manager managing the webhook TLS is a scheduled outage waiting to happen. The default Helm chart cert is valid for 1 year. Most teams discover this at month 13, in production, at 2 AM.