Fixing MutatingAdmissionWebhook Timeout: context deadline exceeded in Kubernetes
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: The
kube-apiservercalled your MutatingAdmissionWebhook and got no response beforetimeoutSecondselapsed — every admission request hitting that webhook is now blocked or failing. - How to fix it: Reduce
timeoutSeconds, setfailurePolicy: Ignorefor non-critical webhooks, fix the webhook pod's health, and verify network reachability from the API server to the webhook service. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your
MutatingWebhookConfigurationYAML — paste it in and get a corrected config instantly.
The Incident (What does the error mean?)
Raw error surfaced on kubectl apply or during pod scheduling:
Error from server (InternalError): Internal error occurred:
failed calling webhook "webhook.example.com": Post
"https://webhook-service.default.svc:443/mutate": context deadline exceeded
The kube-apiserver opens an HTTPS connection to your webhook's Service endpoint and waits. It never gets a response. After timeoutSeconds (default: 10s, max: 30s), the API server gives up. If failurePolicy is Fail (the default), every resource subject to that webhook is now uncreateable. Deployments stall. Operators loop. Cluster admission is effectively bricked for the affected resource types.
The Attack Vector / Blast Radius
This is a single point of failure by design — admission webhooks are synchronous, in-band with every API server write. The blast radius:
- All new Pod/Deployment/Service creates and updates are rejected for any namespace not excluded by
namespaceSelector. - Node autoscaler cannot provision new nodes if the webhook intercepts
NodeorPodobjects. - Operators and controllers enter crash loops retrying failed applies, amplifying API server load.
- Istio, OPA/Gatekeeper, Vault Agent Injector, cert-manager — all use mutating webhooks. One dead pod takes down the injection pipeline for the entire cluster.
- If
failurePolicy: Failis set on a webhook targeting*/*resources, a single unresponsive webhook pod = full cluster write outage.
The secondary failure mode: developers start deleting the MutatingWebhookConfiguration in a panic, disabling security controls (mTLS injection, secret mutation) cluster-wide as a side effect.
How to Fix It (The Solution)
Step 1: Immediate Triage — Identify the webhook and its pod
# Find all mutating webhooks
kubectl get mutatingwebhookconfigurations
# Inspect the broken one
kubectl describe mutatingwebhookconfiguration <name>
# Find the backing service and pod
kubectl get svc -n <webhook-namespace>
kubectl get pods -n <webhook-namespace> -l <selector>
kubectl logs -n <webhook-namespace> <webhook-pod> --tail=100
Look for: pod in CrashLoopBackOff, OOMKilled, TLS handshake errors, or simply no running replicas.
Basic Fix — Patch failurePolicy and timeoutSeconds immediately
If the webhook is non-critical (e.g., a sidecar injector for observability, not security), set it to Ignore to unblock the cluster right now:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: my-webhook
webhooks:
- name: webhook.example.com
admissionReviewVersions: ["v1"]
clientConfig:
service:
name: webhook-service
namespace: default
path: /mutate
- failurePolicy: Fail
+ failurePolicy: Ignore
- timeoutSeconds: 30
+ timeoutSeconds: 10
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
⚠️ Only use
Ignorefor non-security-critical webhooks. For Gatekeeper/OPA or Vault injection, fix the pod — do not setIgnore.
Enterprise Best Practice — Hardened, HA Webhook Configuration
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: my-webhook
webhooks:
- name: webhook.example.com
admissionReviewVersions: ["v1"]
clientConfig:
service:
name: webhook-service
namespace: webhook-system
path: /mutate
+ # Ensure CABundle is current — stale certs cause TLS failures masking as timeouts
+ caBundle: <base64-encoded-CA-cert>
- failurePolicy: Fail
+ failurePolicy: Fail # Keep Fail for security webhooks, but fix HA below
- timeoutSeconds: 30
+ timeoutSeconds: 15 # Never use 30s; degrades API server responsiveness
+ # Exclude kube-system and the webhook's own namespace to prevent deadlock
+ namespaceSelector:
+ matchExpressions:
+ - key: kubernetes.io/metadata.name
+ operator: NotIn
+ values: ["kube-system", "webhook-system", "kube-public"]
+ # Scope rules tightly — never use wildcard resources in production
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: ["apps", ""]
apiVersions: ["v1"]
- resources: ["*"]
+ resources: ["pods", "deployments"]
Webhook Deployment — enforce HA:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webhook-server
namespace: webhook-system
spec:
- replicas: 1
+ replicas: 3
+ strategy:
+ rollingUpdate:
+ maxUnavailable: 0 # Zero-downtime rollouts for admission-critical workloads
template:
spec:
containers:
- name: webhook
+ resources:
+ requests:
+ cpu: 100m
+ memory: 128Mi
+ limits:
+ cpu: 500m
+ memory: 256Mi
+ readinessProbe:
+ httpGet:
+ path: /healthz
+ port: 8443
+ scheme: HTTPS
+ initialDelaySeconds: 5
+ periodSeconds: 5
Also verify network policy isn't blocking API server → webhook traffic:
# kube-apiserver egress to webhook service port (typically 443 or 8443)
# If using Calico/Cilium, ensure a NetworkPolicy allows ingress from the API server node IPs
kubectl get networkpolicy -n webhook-system
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Conftest/OPA — enforce namespaceSelector exclusions and timeoutSeconds caps at apply time:
# policy/webhook_timeout.rego
package kubernetes.admission
deny[msg] {
input.kind == "MutatingWebhookConfiguration"
webhook := input.webhooks[_]
webhook.timeoutSeconds > 15
msg := sprintf("Webhook '%v' timeoutSeconds exceeds 15s — API server degradation risk", [webhook.name])
}
deny[msg] {
input.kind == "MutatingWebhookConfiguration"
webhook := input.webhooks[_]
not webhook.namespaceSelector
msg := sprintf("Webhook '%v' missing namespaceSelector — kube-system deadlock risk", [webhook.name])
}
2. Checkov — scan webhook YAML in PRs:
checkov -f mutating-webhook.yaml --check CKV_K8S_35
3. cert-manager — automate caBundle rotation to prevent TLS failures that surface as timeouts:
# Annotate the MutatingWebhookConfiguration for auto-injection
metadata:
annotations:
cert-manager.io/inject-ca-from: webhook-system/webhook-tls
4. Alerting — fire before users hit the error:
# Prometheus alert
- alert: WebhookHighLatency
expr: apiserver_admission_webhook_admission_duration_seconds{type="admit", operation="mutating"} > 5
for: 2m
labels:
severity: critical
annotations:
summary: "MutatingWebhook latency > 5s — timeout imminent"
5. Chaos testing — run kube-monkey or manually kubectl scale deploy webhook-server --replicas=0 in staging and verify failurePolicy behavior matches your intent before it happens in production.