Fixing K3s 'Failed to Find Master Node' in HA Embedded Etcd Clusters
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: One or more K3s server nodes used
--server(join flag) instead of--cluster-init, or pointed at an unreachable/wrong cluster URL, causing the embedded etcd quorum to never form or a node to spin forever looking for a bootstrap peer. - How to fix it: The first server node must use
--cluster-init. Subsequent nodes must use--server https://<FIRST_NODE_IP>:6443with a matchingK3S_TOKEN. Verify etcd peer health withk3s etcd-snapshot lsandetcdctl member list. - Shortcut: Use our Client-Side Sandbox above to auto-refactor your K3s service config — paste your unit file or
config.yamland get a corrected output without sending secrets to any server.
The Incident (What does the error mean?)
Raw log output from journalctl -u k3s -f:
FATA[0060] failed to find master node
FATA[0060] starting kubernetes: preparing server: failed to get CA certs: Get "https://10.0.0.11:6443/cacerts": dial tcp 10.0.0.11:6443: connect: connection refused
ERRO[0030] etcd cluster is unavailable
ERRO[0031] failed to reconcile etcd cluster membership
Immediate consequence: The K3s control plane never reaches Ready state. kubectl get nodes returns nothing or connection refused. Any workloads scheduled to this node are unschedulable. In a 3-node HA setup, if 2 of 3 servers fail to join, etcd has no quorum — the entire cluster is dead, including existing workloads that lose API server access.
The Attack Vector / Blast Radius
This is a split-brain / no-quorum failure, not a security exploit, but the blast radius is severe:
Etcd quorum loss: Embedded etcd requires
(n/2)+1members. A 3-node cluster needs 2 healthy members. If node 2 and 3 never join because they can't find the master, you have 1/3 members — writes are rejected, reads may return stale data.Cascading control plane failure: The K3s API server, scheduler, and controller-manager are co-located with etcd on server nodes. No quorum = no API server = no scheduling, no ConfigMap/Secret reads, no service account token validation.
Silent workload degradation: Running pods may continue briefly on worker nodes via cached state, but any pod restart, scaling event, or secret rotation will fail. This creates a deceptive partial-outage that's harder to diagnose than a total blackout.
Bootstrap token exposure risk: Operators under pressure often hardcode
K3S_TOKENin shell history or paste it into Slack to debug — the outage itself creates a secondary credential-leak vector.
How to Fix It (The Solution)
Root Cause Checklist
Before touching configs, verify:
# On each server node — confirm which flags are active
ps aux | grep k3s
# Check etcd member list (run on the node that DID start)
k3s etcd-snapshot ls
/var/lib/rancher/k3s/data/current/bin/etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
member list
# Verify port 6443 is reachable from node2/node3
curl -k https://<NODE1_IP>:6443/cacerts
Basic Fix: Correct the Bootstrap Flags
Node 1 (bootstrap node) — must use --cluster-init:
# /etc/systemd/system/k3s.service (Node 1)
[Service]
- ExecStart=/usr/local/bin/k3s server --server https://10.0.0.11:6443
+ ExecStart=/usr/local/bin/k3s server --cluster-init
Environment="K3S_TOKEN=your-shared-secret"
Node 2 and Node 3 — must use --server, pointing at Node 1:
# /etc/systemd/system/k3s.service (Node 2 / Node 3)
[Service]
- ExecStart=/usr/local/bin/k3s server --cluster-init
+ ExecStart=/usr/local/bin/k3s server --server https://10.0.0.11:6443
Environment="K3S_TOKEN=your-shared-secret"
After editing:
systemctl daemon-reload && systemctl restart k3s
journalctl -u k3s -f # watch for 'etcd cluster member added'
Enterprise Best Practice: Use a VIP / Load Balancer + config.yaml
Hardcoding Node 1's IP into --server means Node 1 is a SPOF for the join process. Use a virtual IP (keepalived, kube-vip, or an external LB) and declarative config:
# /etc/rancher/k3s/config.yaml (Node 1)
- server: "https://10.0.0.11:6443"
+ cluster-init: true
+ tls-san:
+ - "10.0.0.100" # VIP
+ - "k3s-api.internal.example.com"
+ token-file: "/etc/k3s-token" # file-based token, not env var
# /etc/rancher/k3s/config.yaml (Node 2, Node 3)
- cluster-init: true
+ server: "https://10.0.0.100:6443" # VIP, not Node 1 direct IP
+ tls-san:
+ - "10.0.0.100"
+ - "k3s-api.internal.example.com"
+ token-file: "/etc/k3s-token"
Why --tls-san matters: If the SAN list doesn't include the IP/hostname that joining nodes use to reach the API, TLS verification fails with a certificate error that looks identical to a network failure — a notorious time-waster during incident response.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Validate K3s Config in IaC Before Apply
If you're provisioning with Terraform or Ansible, add a pre-flight check:
# In your Ansible playbook pre-task block
- name: Assert only one node has cluster-init
assert:
that:
- groups['k3s_servers'] | map('extract', hostvars, 'k3s_cluster_init') | select('defined') | list | length == 1
fail_msg: "Exactly one server node must have cluster_init: true"
2. OPA/Conftest Policy for K3s config.yaml
# policy/k3s_ha.rego
package k3s
deny[msg] {
input."cluster-init" == true
input.server != ""
msg := "A node cannot have both cluster-init and server set simultaneously."
}
deny[msg] {
not input."cluster-init"
not input.server
msg := "Server node must have either cluster-init or server defined."
}
deny[msg] {
count(input."tls-san") == 0
msg := "tls-san must include at least the VIP or load balancer hostname."
}
Run in CI:
conftest test /etc/rancher/k3s/config.yaml --policy policy/
3. Smoke Test After Node Provisioning
#!/bin/bash
# post-provision smoke test — fail fast before declaring cluster healthy
set -e
kubectl get nodes --no-headers | awk '{print $2}' | grep -v Ready && echo "FAIL: Not all nodes Ready" && exit 1
ETCD_MEMBERS=$(etcdctl member list 2>/dev/null | wc -l)
[ "$ETCD_MEMBERS" -lt 3 ] && echo "FAIL: etcd quorum not met ($ETCD_MEMBERS members)" && exit 1
echo "PASS: Cluster healthy with $ETCD_MEMBERS etcd members"
Wire this into your Terraform null_resource provisioner or Ansible post_tasks so a broken HA join fails the pipeline — not production.