Initializing Enclave...

Fixing CoreDNS NXDOMAIN Errors in Kubernetes: Internal Service Discovery Troubleshooting Guide

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: CoreDNS is returning NXDOMAIN for internal <service>.<namespace>.svc.cluster.local lookups, meaning pods cannot resolve other services — inter-pod communication is dead.
  • How to fix it: Check CoreDNS pod logs for loop/forward errors, validate the Corefile for correct cluster.local zone config, confirm the Service and Endpoints objects exist, and verify pod dnsConfig.ndots is set correctly.
  • Shortcut: Use our Client-Side Sandbox above to paste your CoreDNS ConfigMap or failing nslookup output and auto-generate the corrected config.

The Incident (What Does the Error Mean?)

You'll see this in pod logs or from a debug pod running nslookup or dig:

$ nslookup my-service.my-namespace.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'my-service.my-namespace.svc.cluster.local'

$ dig my-service.my-namespace.svc.cluster.local @10.96.0.10
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345

Or in CoreDNS pod logs:

[ERROR] plugin/errors: 2 my-service.my-namespace.svc.cluster.local. A: NXDOMAIN

Immediate consequence: Any workload using environment-variable-based or DNS-based service discovery fails. HTTP clients get connection refused or timeout. gRPC channels drop. Readiness probes fail. This cascades into pod restarts, HPA thrash, and potentially a full application outage.


The Attack Vector / Blast Radius

This isn't a single-pod problem. CoreDNS is the cluster-wide DNS resolver. When it returns NXDOMAIN for internal names:

  • Every pod in every namespace that calls the affected service fails simultaneously.
  • Sidecar proxies (Envoy/Istio) that rely on DNS for cluster-internal routing break their upstream connection pools.
  • Init containers waiting on DNS resolution loop indefinitely, blocking pod startup across deployments.
  • Database connection pools (Postgres, Redis) that resolve by service name drop all connections and cannot reconnect.
  • If the root cause is a CoreDNS loop (forward pointing back to itself via the node's /etc/resolv.conf), CoreDNS CPU spikes to 100%, making the problem cluster-wide and self-reinforcing.

Common root causes ranked by frequency:

  1. Service or namespace does not exist — typo in service name, wrong namespace, service not yet created.
  2. Corefile forward plugin loop — node's /etc/resolv.conf points to 127.0.0.x, CoreDNS forwards to itself.
  3. Wrong ndots value — pod sends short names without search domain expansion, CoreDNS never sees the FQDN.
  4. Corefile zone misconfigurationcluster.local zone missing or health/ready plugin blocking.
  5. CoreDNS pods are down or OOMKilled — DNS queries hit no healthy upstream.

How to Fix It (The Solution)

Step 0: Confirm the Service Exists

kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace

If the service is missing, that's your answer — no DNS fix will help. Create the Service object first.


Step 1: Check CoreDNS Pod Health

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Look for: Loop detected, SERVFAIL, OOMKilled, or plugin/errors.


Basic Fix: Repair the Corefile (Loop + Forward Misconfiguration)

The most common production-breaking Corefile:

# kubectl edit configmap coredns -n kube-system

 apiVersion: v1
 kind: ConfigMap
 metadata:
   name: coredns
   namespace: kube-system
 data:
   Corefile: |
     .:53 {
         errors
         health
         ready
         kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
         }
         prometheus :9153
-        forward . /etc/resolv.conf
+        forward . 8.8.8.8 8.8.4.4 {
+          max_concurrent 1000
+        }
         cache 30
         loop
         reload
         loadbalance
     }

Why: On many nodes, /etc/resolv.conf contains nameserver 127.0.0.53 (systemd-resolved) or 127.0.0.1. CoreDNS forwarding to 127.x.x.x creates a resolution loop. Replace with explicit upstream resolvers.

After editing, CoreDNS reloads automatically via the reload plugin (within ~30s). Force it:

kubectl rollout restart deployment/coredns -n kube-system

Enterprise Best Practice: Fix ndots + Custom DNS Per Namespace

Pods with default ndots:5 will attempt 5 search-domain suffixes before trying the FQDN. This causes latency and can mask real NXDOMAIN issues. For production workloads:

 # Deployment pod spec
 spec:
   template:
     spec:
+      dnsConfig:
+        options:
+          - name: ndots
+            value: "2"
+          - name: single-request-reopen
       containers:
         - name: app

With ndots:2, a name like my-service.my-namespace is treated as FQDN-like and resolved directly, reducing DNS query volume by ~60% and eliminating false NXDOMAIN from search-domain exhaustion.

For multi-tenant clusters, use NodeLocal DNSCache to eliminate CoreDNS as a SPOF:

# Deploy NodeLocal DNSCache (reduces CoreDNS load, eliminates conntrack race)
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate CoreDNS Config in Pipeline

# Use coredns -conf to dry-run validate Corefile before applying
docker run --rm -v $(pwd)/Corefile:/Corefile coredns/coredns:1.11.1 -conf /Corefile -dns.port 1053 &
sleep 2 && kill %1 && echo "Corefile valid"

2. OPA/Gatekeeper Policy: Block forward . /etc/resolv.conf

package coredns

deny[msg] {
  input.kind == "ConfigMap"
  input.metadata.name == "coredns"
  corefile := input.data.Corefile
  contains(corefile, "forward . /etc/resolv.conf")
  msg := "CoreDNS Corefile must not forward to /etc/resolv.conf — loop risk on nodes using systemd-resolved."
}

3. Alerting: CoreDNS NXDOMAIN Rate Prometheus Alert

- alert: CoreDNSHighNXDOMAINRate
  expr: |
    sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m])) by (server, zone)
    /
    sum(rate(coredns_dns_responses_total[5m])) by (server, zone)
    > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "CoreDNS NXDOMAIN rate exceeds 5% — service discovery degraded"

4. Checkov / Kubesec Scan in CI

checkov -f coredns-configmap.yaml --check CKV_K8S_*
kubesec scan coredns-deployment.yaml

Catch missing resource limits on CoreDNS pods (OOMKill = DNS outage) before they reach production.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →