Fixing CoreDNS 'dial tcp 10.96.0.10:53: connection refused' Under High Load in Kubernetes
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: CoreDNS pods are OOM-killed or CPU-throttled under query surge, causing the ClusterIP
10.96.0.10:53to refuse connections cluster-wide. - How to fix it: Scale CoreDNS replicas, tune resource limits, enable DNS caching, and reduce
ndotson workload pods to slash upstream query volume. - Shortcut: Use our Client-Side Sandbox above to drop your CoreDNS ConfigMap and Deployment YAML — it auto-refactors the config locally without sending your data anywhere.
The Incident (What Does the Error Mean?)
Raw error output from pod logs or application stderr:
dial tcp 10.96.0.10:53: connect: connection refused
EOF
context deadline exceeded
10.96.0.10 is the default kube-dns ClusterIP. connection refused means the CoreDNS pod(s) are not accepting TCP/UDP connections on port 53 — they are either crashed, CrashLooping, or fully CPU-throttled to zero throughput.
Immediate consequence: Every pod in every namespace that relies on in-cluster DNS (which is everything using service discovery) fails. Database connections drop. gRPC channels collapse. HTTP clients throw no such host. The blast radius is the entire cluster.
The Attack Vector / Blast Radius
This is a cascading resource exhaustion failure, not a single-pod issue.
The failure chain:
- Query volume spikes — typically during a deployment rollout, HPA scale-out, or a noisy-neighbor batch job.
- CoreDNS pods hit CPU limits and get throttled by the CFS scheduler. With default
limits.cpu: 100m, a single pod handles ~1,500 QPS before throttling. - Throttled CoreDNS stops responding within the 5s client timeout. Clients retry — multiplying query load.
- The
ndots: 5default in/etc/resolv.confmeans a lookup forredisgenerates 6 sequential DNS queries (appending each search domain) before resolving. Under load this is a 6x query amplifier. - CoreDNS pods OOM if the cache is unbounded. Default
forwardplugin with no caching hits upstream kube-apiserver or external resolvers directly. - All pods cluster-wide enter DNS failure state simultaneously.
Secondary blast: If your workloads use hostNetwork: false and rely on search domains, Prometheus, Grafana, and your logging pipeline also go dark — eliminating your observability exactly when you need it.
How to Fix It
Basic Fix — Scale Replicas and Loosen Resource Limits
# coredns Deployment in kube-system
spec:
replicas:
- 1
+ 3
containers:
- name: coredns
resources:
limits:
- cpu: 100m
- memory: 70Mi
+ cpu: 500m
+ memory: 256Mi
requests:
- cpu: 100m
- memory: 70Mi
+ cpu: 200m
+ memory: 128Mi
Also add a PodDisruptionBudget immediately:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: coredns-pdb
namespace: kube-system
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: kube-dns
Enterprise Best Practice — Cache Tuning + ndots Reduction + HPA
1. CoreDNS ConfigMap — enable aggressive caching and health checks:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health {
+ lameduck 5s
}
+ ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
+ ttl 30
}
+ cache 30 {
+ success 9984 30
+ denial 9984 5
+ }
forward . /etc/resolv.conf {
+ max_concurrent 1000
+ prefer_udp
}
- cache 30
loop
reload
loadbalance
}
2. Reduce ndots on workload pods to stop the query amplification storm:
# In your application Deployment spec.template.spec
dnsConfig:
+ options:
+ - name: ndots
+ value: "2"
+ - name: single-request-reopen
ndots: 2 means redis.default.svc.cluster.local resolves in 1 query instead of 6. This is the single highest-impact change for QPS reduction.
3. Add HPA for CoreDNS (Kubernetes 1.23+):
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+ name: coredns
+ namespace: kube-system
+spec:
+ scaleTargetRef:
+ apiVersion: apps/v1
+ kind: Deployment
+ name: coredns
+ minReplicas: 2
+ maxReplicas: 10
+ metrics:
+ - type: Resource
+ resource:
+ name: cpu
+ target:
+ type: Utilization
+ averageUtilization: 70
4. Spread CoreDNS pods across nodes with topology constraints:
+ topologySpreadConstraints:
+ - maxSkew: 1
+ topologyKey: kubernetes.io/hostname
+ whenUnsatisfiable: DoNotSchedule
+ labelSelector:
+ matchLabels:
+ k8s-app: kube-dns
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
Gate these checks before any cluster change reaches production:
1. OPA/Gatekeeper policy — enforce minimum CoreDNS replicas:
package coredns
violation[{"msg": msg}] {
input.review.object.metadata.name == "coredns"
input.review.object.spec.replicas < 2
msg := "CoreDNS must run at least 2 replicas"
}
2. Checkov custom check — flag missing dnsConfig.ndots override in Deployments:
# .checkov/custom_checks/ndots_check.yaml
id: CKV_CUSTOM_NDOTS
name: "Ensure ndots is set to 2 or less"
category: NETWORKING
definition:
and:
- cond_type: attribute
resource_types: [kubernetes_deployment]
attribute: spec.template.spec.dnsConfig.options
operator: contains
value: "ndots"
3. Load test CoreDNS in staging before every rollout:
# Run dnsperf against kube-dns ClusterIP from within cluster
kubectl run dnsperf --image=guessi/dnsperf:latest --rm -it -- \
dnsperf -s 10.96.0.10 -d /queries.txt -l 30 -c 20 -Q 5000
Set a hard gate: if QPS < 3000 or latency p99 > 50ms, block the deployment pipeline.
4. Alert before the outage — not after:
# Prometheus alert
- alert: CoreDNSHighLatency
expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "CoreDNS p99 latency > 50ms — scale now before connection refused"