Why does my StatefulSet show Running pods but the headless Service still has no endpoints?

The Kubernetes Endpoints controller only adds a pod to a Service's endpoint list if the pod's labels exactly match every key-value pair in the Service's spec.selector AND the pod is in Ready state. If even one label key or value differs between the Service selector and the pod template labels, the pod is invisible to that Service regardless of its Running status. Check with: kubectl get pods --show-labels and compare against kubectl describe svc my-service | grep Selector.

Does changing the headless Service selector on a live StatefulSet cause downtime?

Updating the Service selector is non-disruptive to the pods themselves — pods keep running. However, there is a brief reconciliation window (typically under 5 seconds) during which the Endpoints object is being updated and DNS may return stale or empty results. For zero-downtime fixes, apply the corrected selector during a low-traffic window and monitor endpoint count with: watch kubectl get endpoints my-service.

How do I validate that a headless Service selector matches a StatefulSet before applying to a cluster?

Use kubectl apply --dry-run=server to catch admission-level rejections, then use Conftest with a Rego policy that cross-references the Service selector against the StatefulSet pod template labels in the same manifest directory. For Helm charts, use helm template | conftest test - to pipe rendered manifests through policy checks in CI before any cluster interaction.

How to Fix StatefulSet Headless Service Selector Mismatch Causing Zero Endpoints in Kubernetes

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5–10 mins

TL;DR

What broke: The headless Service's spec.selector does not match the labels on the StatefulSet's spec.template.metadata.labels, so kube-proxy and the Endpoints controller register zero backing pods.
How to fix it: Align the Service spec.selector labels exactly with the pod template labels in the StatefulSet. Verify with kubectl get endpoints my-service.
Shortcut: Use our Client-Side Sandbox above to paste both YAMLs and auto-refactor the selector alignment without leaking your config to any third-party server.

The Incident (What Does the Error Mean?)

$ kubectl get endpoints my-service
NAME         ENDPOINTS   AGE
my-service   <none>      8m

$ kubectl describe service my-service
....
Selector:          app=my-svc
Endpoints:         <none>

The Endpoints controller continuously reconciles pods against the Service selector. When zero pods match, the Endpoints object stays empty. Every DNS query for my-service.namespace.svc.cluster.local or the stable pod DNS my-pod-0.my-service.namespace.svc.cluster.local returns NXDOMAIN or no A records. All inter-pod communication, client connections, and StatefulSet ordered startup probes that depend on peer DNS resolution are dead.

The Attack Vector / Blast Radius

This is a silent misconfiguration. The StatefulSet deploys successfully — pods reach Running state — and kubectl rollout status reports healthy. Nothing in the control plane throws a hard error. The failure only surfaces when a client or sidecar tries to resolve the headless DNS name.

Cascading failure chain:

StatefulSet peer discovery breaks — distributed systems like Cassandra, Kafka, Zookeeper, and etcd use headless DNS for cluster membership. Zero endpoints = split-brain or failed bootstrap on every replica.
Readiness probes that call peer pods fail — pods cycle into NotReady, triggering rolling restarts that never converge.
Persistent Volume claims stay bound to pods that never join the cluster — data nodes sit idle, burning storage cost with zero throughput.
Monitoring gaps — because pods are Running, PagerDuty/Datadog pod-health alerts stay green. The outage is invisible until application-layer errors surface.

How to Fix It

Basic Fix — Align the Selector

The Service spec.selector must be a subset of the pod template labels. Every key-value pair in the selector must exist verbatim in spec.template.metadata.labels.

# headless-service.yaml
 apiVersion: v1
 kind: Service
 metadata:
   name: my-service
 spec:
   clusterIP: None
   selector:
-    app: my-svc
+    app: my-service
   ports:
     - port: 9042
       targetPort: 9042

# statefulset.yaml
 apiVersion: apps/v1
 kind: StatefulSet
 metadata:
   name: my-service
 spec:
   serviceName: "my-service"
   selector:
     matchLabels:
-      app: my-service-node
+      app: my-service
   template:
     metadata:
       labels:
-        app: my-service-node
+        app: my-service

Verify immediately:

# Endpoints should now list pod IPs
kubectl get endpoints my-service -n <namespace>

# Confirm pod labels match
kubectl get pods -l app=my-service -n <namespace>

# Test headless DNS resolution from within the cluster
kubectl run dns-test --image=busybox:1.36 --restart=Never -it --rm \
  -- nslookup my-service.<namespace>.svc.cluster.local

Enterprise Best Practice — Enforce Label Contracts with a Shared Label Schema

The root cause is label values being defined in two places with no single source of truth. Fix this structurally:

1. Use Helm named templates to stamp labels from one definition:

# _helpers.tpl
+{{- define "mychart.selectorLabels" -}}
+app.kubernetes.io/name: {{ .Chart.Name }}
+app.kubernetes.io/instance: {{ .Release.Name }}
+{{- end }}

# service.yaml
 spec:
   selector:
-    app: my-svc
+    {{- include "mychart.selectorLabels" . | nindent 4 }}

# statefulset.yaml
 spec:
   selector:
     matchLabels:
-      app: my-service-node
+      {{- include "mychart.selectorLabels" . | nindent 6 }}
   template:
     metadata:
       labels:
-        app: my-service-node
+        {{- include "mychart.selectorLabels" . | nindent 8 }}

2. Use kubectl.kubernetes.io/last-applied-configuration diff in your CD pipeline to catch selector drift before apply.

3. Add a pre-deploy validation step:

# Dry-run apply and grep for endpoint readiness
kubectl apply --dry-run=server -f manifests/ \
  && kubectl get endpoints my-service -n staging | grep -v '<none>'

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. OPA/Gatekeeper — Enforce Selector Consistency at Admission

Write a ConstraintTemplate that validates the headless Service selector is a subset of the StatefulSet pod template labels at admission time. This blocks the misconfiguration before it ever reaches a cluster.

2. Conftest + Rego Policy in Pull Requests

# policy/statefulset_service_selector.rego
package kubernetes.statefulset

deny[msg] {
  input.kind == "StatefulSet"
  svc := data.services[_]
  svc.spec.clusterIP == "None"
  svc.metadata.name == input.spec.serviceName
  key := svc.spec.selector[k]
  not input.spec.template.metadata.labels[k] == key
  msg := sprintf("Service selector key '%v' not found in StatefulSet pod template labels", [k])
}

Run in CI:

conftest test manifests/ --policy policy/

3. Checkov — Static Analysis

checkov -d . --framework kubernetes --check CKV_K8S_43

4. Kustomize `commonLabels` — Single Label Definition Across All Resources

# kustomization.yaml
commonLabels:
  app: my-service
  app.kubernetes.io/part-of: my-platform

Kustomize stamps identical labels on the Service selector and StatefulSet pod template, eliminating manual sync entirely.

5. Post-Deploy Smoke Test in GitOps Pipeline

# ArgoCD post-sync hook or Flux health check
kubectl wait --for=condition=Ready pod -l app=my-service -n production --timeout=120s \
  && kubectl get endpoints my-service -n production | grep -v '<none>' \
  || (echo "ENDPOINT CHECK FAILED" && exit 1)

Fail the pipeline if endpoints are empty 120 seconds post-deploy. This catches the regression even if OPA is misconfigured.