Why does CoreDNS return NXDOMAIN for a service that clearly exists in kubectl get svc?

The service exists but DNS resolution can still fail for several reasons: the pod is querying the wrong namespace (e.g., `my-service` instead of `my-service.my-namespace.svc.cluster.local`), the pod's `ndots` setting causes search domain exhaustion before the correct FQDN is tried, or CoreDNS pods themselves are in a crash loop or OOMKilled state and not serving the `cluster.local` zone. Run `kubectl get endpoints my-service -n my-namespace` — if Endpoints shows ` `, the service has no healthy backing pods, which is a separate but related failure.

What is the CoreDNS loop plugin and why does it cause NXDOMAIN?

The `loop` plugin detects when CoreDNS is forwarding queries to an upstream that routes them back to CoreDNS itself — typically when `/etc/resolv.conf` on the node contains `127.0.0.53` (systemd-resolved) or `127.0.0.1`. When a loop is detected, CoreDNS intentionally crashes (exits) to prevent infinite recursion. This takes down DNS for the entire cluster. The fix is to replace `forward . /etc/resolv.conf` with explicit public or private resolvers like `forward . 8.8.8.8 8.8.4.4` in the Corefile.

Fixing CoreDNS NXDOMAIN Errors in Kubernetes: Internal Service Discovery Troubleshooting Guide

Q: How do I debug CoreDNS NXDOMAIN without disrupting production pods?

Spin up a throwaway debug pod with DNS tools: `kubectl run dnsutils --image=registry.k8s.io/e2e-test-images/jessie-dnsutils:1.3 --restart=Never -it -- bash`. Then run `nslookup my-service.my-namespace.svc.cluster.local` and `dig my-service.my-namespace.svc.cluster.local @10.96.0.10 +short`. Also enable CoreDNS debug logging temporarily by adding the `log` plugin to the Corefile — it will print every query and response. Remove it after diagnosis as it generates significant log volume under load.

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: CoreDNS is returning NXDOMAIN for internal <service>.<namespace>.svc.cluster.local lookups, meaning pods cannot resolve other services — inter-pod communication is dead.
How to fix it: Check CoreDNS pod logs for loop/forward errors, validate the Corefile for correct cluster.local zone config, confirm the Service and Endpoints objects exist, and verify pod dnsConfig.ndots is set correctly.
Shortcut: Use our Client-Side Sandbox above to paste your CoreDNS ConfigMap or failing nslookup output and auto-generate the corrected config.

The Incident (What Does the Error Mean?)

You'll see this in pod logs or from a debug pod running nslookup or dig:

$ nslookup my-service.my-namespace.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

nslookup: can't resolve 'my-service.my-namespace.svc.cluster.local'

$ dig my-service.my-namespace.svc.cluster.local @10.96.0.10
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12345

Or in CoreDNS pod logs:

[ERROR] plugin/errors: 2 my-service.my-namespace.svc.cluster.local. A: NXDOMAIN

Immediate consequence: Any workload using environment-variable-based or DNS-based service discovery fails. HTTP clients get connection refused or timeout. gRPC channels drop. Readiness probes fail. This cascades into pod restarts, HPA thrash, and potentially a full application outage.

The Attack Vector / Blast Radius

This isn't a single-pod problem. CoreDNS is the cluster-wide DNS resolver. When it returns NXDOMAIN for internal names:

Every pod in every namespace that calls the affected service fails simultaneously.
Sidecar proxies (Envoy/Istio) that rely on DNS for cluster-internal routing break their upstream connection pools.
Init containers waiting on DNS resolution loop indefinitely, blocking pod startup across deployments.
Database connection pools (Postgres, Redis) that resolve by service name drop all connections and cannot reconnect.
If the root cause is a CoreDNS loop (forward pointing back to itself via the node's /etc/resolv.conf), CoreDNS CPU spikes to 100%, making the problem cluster-wide and self-reinforcing.

Common root causes ranked by frequency:

Service or namespace does not exist — typo in service name, wrong namespace, service not yet created.
Corefile forward plugin loop — node's /etc/resolv.conf points to 127.0.0.x, CoreDNS forwards to itself.
Wrong ndots value — pod sends short names without search domain expansion, CoreDNS never sees the FQDN.
Corefile zone misconfiguration — cluster.local zone missing or health/ready plugin blocking.
CoreDNS pods are down or OOMKilled — DNS queries hit no healthy upstream.

How to Fix It (The Solution)

Step 0: Confirm the Service Exists

kubectl get svc my-service -n my-namespace
kubectl get endpoints my-service -n my-namespace

If the service is missing, that's your answer — no DNS fix will help. Create the Service object first.

Step 1: Check CoreDNS Pod Health

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Look for: Loop detected, SERVFAIL, OOMKilled, or plugin/errors.

Basic Fix: Repair the Corefile (Loop + Forward Misconfiguration)

The most common production-breaking Corefile:

# kubectl edit configmap coredns -n kube-system

 apiVersion: v1
 kind: ConfigMap
 metadata:
   name: coredns
   namespace: kube-system
 data:
   Corefile: |
     .:53 {
         errors
         health
         ready
         kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
         }
         prometheus :9153
-        forward . /etc/resolv.conf
+        forward . 8.8.8.8 8.8.4.4 {
+          max_concurrent 1000
+        }
         cache 30
         loop
         reload
         loadbalance
     }

Why: On many nodes, /etc/resolv.conf contains nameserver 127.0.0.53 (systemd-resolved) or 127.0.0.1. CoreDNS forwarding to 127.x.x.x creates a resolution loop. Replace with explicit upstream resolvers.

After editing, CoreDNS reloads automatically via the reload plugin (within ~30s). Force it:

kubectl rollout restart deployment/coredns -n kube-system

Enterprise Best Practice: Fix ndots + Custom DNS Per Namespace

Pods with default ndots:5 will attempt 5 search-domain suffixes before trying the FQDN. This causes latency and can mask real NXDOMAIN issues. For production workloads:

 # Deployment pod spec
 spec:
   template:
     spec:
+      dnsConfig:
+        options:
+          - name: ndots
+            value: "2"
+          - name: single-request-reopen
       containers:
         - name: app

With ndots:2, a name like my-service.my-namespace is treated as FQDN-like and resolved directly, reducing DNS query volume by ~60% and eliminating false NXDOMAIN from search-domain exhaustion.

For multi-tenant clusters, use NodeLocal DNSCache to eliminate CoreDNS as a SPOF:

# Deploy NodeLocal DNSCache (reduces CoreDNS load, eliminates conntrack race)
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Validate CoreDNS Config in Pipeline

# Use coredns -conf to dry-run validate Corefile before applying
docker run --rm -v $(pwd)/Corefile:/Corefile coredns/coredns:1.11.1 -conf /Corefile -dns.port 1053 &
sleep 2 && kill %1 && echo "Corefile valid"

2. OPA/Gatekeeper Policy: Block `forward . /etc/resolv.conf`

package coredns

deny[msg] {
  input.kind == "ConfigMap"
  input.metadata.name == "coredns"
  corefile := input.data.Corefile
  contains(corefile, "forward . /etc/resolv.conf")
  msg := "CoreDNS Corefile must not forward to /etc/resolv.conf — loop risk on nodes using systemd-resolved."
}

3. Alerting: CoreDNS NXDOMAIN Rate Prometheus Alert

- alert: CoreDNSHighNXDOMAINRate
  expr: |
    sum(rate(coredns_dns_responses_total{rcode="NXDOMAIN"}[5m])) by (server, zone)
    /
    sum(rate(coredns_dns_responses_total[5m])) by (server, zone)
    > 0.05
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "CoreDNS NXDOMAIN rate exceeds 5% — service discovery degraded"

4. Checkov / Kubesec Scan in CI

checkov -f coredns-configmap.yaml --check CKV_K8S_*
kubesec scan coredns-deployment.yaml

Catch missing resource limits on CoreDNS pods (OOMKill = DNS outage) before they reach production.