Fixing CoreDNS NXDOMAIN Errors in Kubernetes: Internal Service Discovery Troubleshooting Guide
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: CoreDNS is returning
NXDOMAINfor internal<service>.<namespace>.svc.cluster.locallookups, meaning pods cannot resolve other services — inter-pod communication is dead. - How to fix it: Check CoreDNS pod logs for loop/forward errors, validate the Corefile for correct
cluster.localzone config, confirm the Service and Endpoints objects exist, and verify poddnsConfig.ndotsis set correctly. - Shortcut: Use our Client-Side Sandbox above to paste your CoreDNS ConfigMap or failing
nslookupoutput and auto-generate the corrected config.
The Incident (What Does the Error Mean?)
You'll see this in pod logs or from a debug pod running nslookup or dig:
$ nslookup my-service.my-namespace.svc.cluster.local
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
nslookup: can't resolve 'my-service.my-namespace.svc.cluster.local'
$ dig my-service.my-namespace.svc.cluster.local @10.96.0.10
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345
Or in CoreDNS pod logs:
[ERROR] plugin/errors: 2 my-service.my-namespace.svc.cluster.local. A: NXDOMAIN
Immediate consequence: Any workload using environment-variable-based or DNS-based service discovery fails. HTTP clients get connection refused or timeout. gRPC channels drop. Readiness probes fail. This cascades into pod restarts, HPA thrash, and potentially a full application outage.
The Attack Vector / Blast Radius
This isn't a single-pod problem. CoreDNS is the cluster-wide DNS resolver. When it returns NXDOMAIN for internal names:
- Every pod in every namespace that calls the affected service fails simultaneously.
- Sidecar proxies (Envoy/Istio) that rely on DNS for cluster-internal routing break their upstream connection pools.
- Init containers waiting on DNS resolution loop indefinitely, blocking pod startup across deployments.
- Database connection pools (Postgres, Redis) that resolve by service name drop all connections and cannot reconnect.
- If the root cause is a CoreDNS loop (forward pointing back to itself via the node's
/etc/resolv.conf), CoreDNS CPU spikes to 100%, making the problem cluster-wide and self-reinforcing.
Common root causes ranked by frequency:
- Service or namespace does not exist — typo in service name, wrong namespace, service not yet created.
- Corefile
forwardplugin loop — node's/etc/resolv.confpoints to127.0.0.x, CoreDNS forwards to itself. - Wrong
ndotsvalue — pod sends short names without search domain expansion, CoreDNS never sees the FQDN. - Corefile zone misconfiguration —
cluster.localzone missing orhealth/readyplugin blocking. - CoreDNS pods are down or OOMKilled — DNS queries hit no healthy upstream.
How to Fix It (The Solution)
Step 0: Confirm the Service Exists
kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace
If the service is missing, that's your answer — no DNS fix will help. Create the Service object first.
Step 1: Check CoreDNS Pod Health
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Look for: Loop detected, SERVFAIL, OOMKilled, or plugin/errors.
Basic Fix: Repair the Corefile (Loop + Forward Misconfiguration)
The most common production-breaking Corefile:
# kubectl edit configmap coredns -n kube-system
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
- forward . /etc/resolv.conf
+ forward . 8.8.8.8 8.8.4.4 {
+ max_concurrent 1000
+ }
cache 30
loop
reload
loadbalance
}
Why: On many nodes, /etc/resolv.conf contains nameserver 127.0.0.53 (systemd-resolved) or 127.0.0.1. CoreDNS forwarding to 127.x.x.x creates a resolution loop. Replace with explicit upstream resolvers.
After editing, CoreDNS reloads automatically via the reload plugin (within ~30s). Force it:
kubectl rollout restart deployment/coredns -n kube-system
Enterprise Best Practice: Fix ndots + Custom DNS Per Namespace
Pods with default ndots:5 will attempt 5 search-domain suffixes before trying the FQDN. This causes latency and can mask real NXDOMAIN issues. For production workloads:
# Deployment pod spec
spec:
template:
spec:
+ dnsConfig:
+ options:
+ - name: ndots
+ value: "2"
+ - name: single-request-reopen
containers:
- name: app
With ndots:2, a name like my-service.my-namespace is treated as FQDN-like and resolved directly, reducing DNS query volume by ~60% and eliminating false NXDOMAIN from search-domain exhaustion.
For multi-tenant clusters, use NodeLocal DNSCache to eliminate CoreDNS as a SPOF:
# Deploy NodeLocal DNSCache (reduces CoreDNS load, eliminates conntrack race)
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate CoreDNS Config in Pipeline
# Use coredns -conf to dry-run validate Corefile before applying
docker run --rm -v $(pwd)/Corefile:/Corefile coredns/coredns:1.11.1 -conf /Corefile -dns.port 1053 &
sleep 2 && kill %1 && echo "Corefile valid"
2. OPA/Gatekeeper Policy: Block forward . /etc/resolv.conf
package coredns
deny[msg] {
input.kind == "ConfigMap"
input.metadata.name == "coredns"
corefile := input.data.Corefile
contains(corefile, "forward . /etc/resolv.conf")
msg := "CoreDNS Corefile must not forward to /etc/resolv.conf — loop risk on nodes using systemd-resolved."
}
3. Alerting: CoreDNS NXDOMAIN Rate Prometheus Alert
- alert: CoreDNSHighNXDOMAINRate
expr: |
sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m])) by (server, zone)
/
sum(rate(coredns_dns_responses_total[5m])) by (server, zone)
> 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "CoreDNS NXDOMAIN rate exceeds 5% — service discovery degraded"
4. Checkov / Kubesec Scan in CI
checkov -f coredns-configmap.yaml --check CKV_K8S_*
kubesec scan coredns-deployment.yaml
Catch missing resource limits on CoreDNS pods (OOMKill = DNS outage) before they reach production.