How to Fix Etcd Leader Election Loss and Network Partition in Kubernetes
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 20–45 mins
TL;DR
- What broke: An etcd member lost quorum visibility due to a network partition or I/O stall, triggering a leader re-election loop that blocks all Kubernetes API server write operations.
- How to fix it: Restore network connectivity or disk I/O, force-remove the partitioned member, restore quorum, and tune heartbeat/election timeouts to match your network's actual RTT.
- Use our Client-Side Sandbox above to paste your etcd config and auto-generate corrected timeout values and cluster membership commands.
The Incident (What Does the Error Mean?)
Raw etcd log output from the partitioned node:
raft2025/01/15 03:12:44 INFO: raft.node: 8e9e05c52164694d lost leader 8e9e05c52164694d at term 42
raft2025/01/15 03:12:44 INFO: raft.node: 8e9e05c52164694d is starting a new election at term 42
raft2025/01/15 03:12:44 WARN: etcdserver: failed to send out heartbeat on time; took too long, leader is overloaded
raft2025/01/15 03:12:44 WARN: etcdserver: server is likely overloaded
{"level":"warn","ts":"2025-01-15T03:12:45.112Z","caller":"etcdserver/server.go:1084","msg":"failed to reach the peer URL","peer-id":"a8266ecf031671f3","peer-url":"https://10.0.1.12:2380"}
{"level":"warn","ts":"2025-01-15T03:12:47.001Z","caller":"raft/raft.go:900","msg":"no leader elected"}
Immediate consequence: The Kubernetes API server returns etcdserver: leader changed or hangs on all mutating requests (kubectl apply, kubectl delete, pod scheduling). The cluster enters a read-only degraded state. No new pods schedule. Existing workloads continue running, but the control plane is effectively dead.
The Attack Vector / Blast Radius
This is a quorum failure, not a security exploit — but the blast radius is equivalent to a full control plane outage.
Cascade chain:
- Network partition or disk I/O spike causes heartbeat packets to exceed
--election-timeouton follower nodes. - Followers declare the leader dead and start a new election. If the partitioned node can still see some peers but not all, you get a split-brain election loop — no candidate achieves
(N/2)+1votes. kube-apiserverloses its etcd watch stream. All LIST/WATCH calls to etcd block or return errors.kube-schedulerandkube-controller-managerlose leader leases. They stop functioning.- CoreDNS, ingress controllers, HPA — anything that writes to the API server — queues up or fails.
- If the partition persists >5 minutes, etcd WAL can diverge. A naive
etcdctl member remove+ rejoin without a snapshot restore will corrupt the cluster.
Compounding factor: On cloud VMs, a noisy-neighbor disk I/O spike on the etcd node's underlying host causes fsync latency to spike above the election timeout. This is the #1 cause of spurious etcd leader elections in AWS/GCP environments. etcd is brutally sensitive to disk latency — p99 fsync must stay under 10ms.
How to Fix It (The Solution)
Step 0 — Verify Quorum Status First
# Run from a healthy etcd member or a node with etcdctl access
export ETCDCTL_API=3
etcdctl --endpoints=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
endpoint status --write-out=table
# Check which member is partitioned
etcdctl member list --write-out=table
If you see IS LEADER: false on all nodes and ERRORS: context deadline exceeded — quorum is lost.
Basic Fix — Restore Quorum by Removing the Partitioned Member
Only do this if you have at least 2 of 3 members healthy (for a 3-node cluster).
# 1. Get the member ID of the partitioned node
etcdctl member list
# Output: a8266ecf031671f3, started, etcd-node3, https://10.0.1.12:2380, ...
# 2. Remove it
etcdctl member remove a8266ecf031671f3
# 3. On the recovered/replaced node, wipe the data dir and rejoin
systemctl stop etcd
rm -rf /var/lib/etcd/member
# 4. Re-add as a new member
etcdctl member add etcd-node3 --peer-urls=https://10.0.1.12:2380
# 5. Start etcd with INITIAL_CLUSTER_STATE=existing
systemctl start etcd
Enterprise Best Practice — Fix the Root Cause: Timeout Tuning + Disk Isolation
The default --heartbeat-interval=100ms and --election-timeout=1000ms are too aggressive for cloud environments with variable network RTT or shared disk I/O.
# /etc/etcd/etcd.conf or kubeadm etcd extraArgs
- --heartbeat-interval=100
- --election-timeout=1000
+ --heartbeat-interval=250
+ --election-timeout=2500
# Disk I/O isolation — etcd data on dedicated SSD/NVMe
- --data-dir=/var/lib/etcd
+ --data-dir=/mnt/etcd-ssd/etcd
# Increase snapshot frequency to reduce WAL replay time on rejoin
- --snapshot-count=10000
+ --snapshot-count=5000
# Explicit peer URLs — never use 0.0.0.0 for peer communication
- --listen-peer-urls=https://0.0.0.0:2380
+ --listen-peer-urls=https://10.0.1.10:2380
- --initial-advertise-peer-urls=https://0.0.0.0:2380
+ --initial-advertise-peer-urls=https://10.0.1.10:2380
For kubeadm clusters, patch via kubeadm-config ConfigMap:
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
etcd:
local:
extraArgs:
- heartbeat-interval: "100"
- election-timeout: "1000"
+ heartbeat-interval: "250"
+ election-timeout: "2500"
+ snapshot-count: "5000"
Disk latency validation — run this before and after moving etcd to a dedicated disk:
# Benchmark fsync latency — must be <10ms p99
fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd \
--size=22m --bs=2300 --name=etcd-fio-test
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Etcd Health Gate in CI
Add an etcd health check to your cluster provisioning pipeline before any control plane operation:
# In your Terraform/Ansible post-apply step
etcdctl endpoint health --cluster 2>&1 | grep -v "is healthy" && exit 1
2. Prometheus Alerting Rules
Deploy these alerts. If you don't have these firing before a partition happens, you're flying blind:
# prometheus-etcd-alerts.yaml
groups:
- name: etcd.critical
rules:
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Etcd cluster has no leader — control plane writes halted"
- alert: EtcdHighFsyncDuration
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd WAL fsync p99 > 10ms — leader election risk"
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd peer RTT p99 > 150ms — election timeout risk"
3. Infrastructure Policy (OPA/Gatekeeper)
Enforce etcd node placement on dedicated instance types with local NVMe:
# opa-etcd-placement.rego
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Node"
input.request.object.metadata.labels["node-role.kubernetes.io/etcd"] == "true"
not input.request.object.metadata.labels["node.kubernetes.io/instance-type"] in {"i3.xlarge", "i3en.xlarge", "n2-highmem-4"}
msg := "Etcd nodes must run on NVMe-backed instance types to guarantee fsync latency SLA"
}
4. Terraform Validation
# Enforce etcd nodes on dedicated AZs — never co-locate all 3 in one AZ
resource "aws_instance" "etcd" {
count = 3
availability_zone = element(["us-east-1a", "us-east-1b", "us-east-1c"], count.index)
instance_type = "i3.xlarge" # Local NVMe mandatory
lifecycle {
prevent_destroy = true # Never destroy etcd nodes without snapshot backup
}
}
Run checkov -f etcd.tf --check CKV_AWS_8 to validate encryption and placement before apply.