Initializing Enclave...

Fixing K3s 'Failed to Find Master Node' in HA Embedded Etcd Clusters

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

  • What broke: One or more K3s server nodes used --server (join flag) instead of --cluster-init, or pointed at an unreachable/wrong cluster URL, causing the embedded etcd quorum to never form or a node to spin forever looking for a bootstrap peer.
  • How to fix it: The first server node must use --cluster-init. Subsequent nodes must use --server https://<FIRST_NODE_IP>:6443 with a matching K3S_TOKEN. Verify etcd peer health with k3s etcd-snapshot ls and etcdctl member list.
  • Shortcut: Use our Client-Side Sandbox above to auto-refactor your K3s service config — paste your unit file or config.yaml and get a corrected output without sending secrets to any server.

The Incident (What does the error mean?)

Raw log output from journalctl -u k3s -f:

FATA[0060] failed to find master node
FATA[0060] starting kubernetes: preparing server: failed to get CA certs: Get "https://10.0.0.11:6443/cacerts": dial tcp 10.0.0.11:6443: connect: connection refused
ERRO[0030] etcd cluster is unavailable
ERRO[0031] failed to reconcile etcd cluster membership

Immediate consequence: The K3s control plane never reaches Ready state. kubectl get nodes returns nothing or connection refused. Any workloads scheduled to this node are unschedulable. In a 3-node HA setup, if 2 of 3 servers fail to join, etcd has no quorum — the entire cluster is dead, including existing workloads that lose API server access.


The Attack Vector / Blast Radius

This is a split-brain / no-quorum failure, not a security exploit, but the blast radius is severe:

  1. Etcd quorum loss: Embedded etcd requires (n/2)+1 members. A 3-node cluster needs 2 healthy members. If node 2 and 3 never join because they can't find the master, you have 1/3 members — writes are rejected, reads may return stale data.

  2. Cascading control plane failure: The K3s API server, scheduler, and controller-manager are co-located with etcd on server nodes. No quorum = no API server = no scheduling, no ConfigMap/Secret reads, no service account token validation.

  3. Silent workload degradation: Running pods may continue briefly on worker nodes via cached state, but any pod restart, scaling event, or secret rotation will fail. This creates a deceptive partial-outage that's harder to diagnose than a total blackout.

  4. Bootstrap token exposure risk: Operators under pressure often hardcode K3S_TOKEN in shell history or paste it into Slack to debug — the outage itself creates a secondary credential-leak vector.


How to Fix It (The Solution)

Root Cause Checklist

Before touching configs, verify:

# On each server node — confirm which flags are active
ps aux | grep k3s

# Check etcd member list (run on the node that DID start)
k3s etcd-snapshot ls
/var/lib/rancher/k3s/data/current/bin/etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
  --cert=/var/lib/rancher/k3s/server/tls/etcd/client.crt \
  --key=/var/lib/rancher/k3s/server/tls/etcd/client.key \
  member list

# Verify port 6443 is reachable from node2/node3
curl -k https://<NODE1_IP>:6443/cacerts

Basic Fix: Correct the Bootstrap Flags

Node 1 (bootstrap node) — must use --cluster-init:

# /etc/systemd/system/k3s.service  (Node 1)
[Service]
- ExecStart=/usr/local/bin/k3s server --server https://10.0.0.11:6443
+ ExecStart=/usr/local/bin/k3s server --cluster-init
  Environment="K3S_TOKEN=your-shared-secret"

Node 2 and Node 3 — must use --server, pointing at Node 1:

# /etc/systemd/system/k3s.service  (Node 2 / Node 3)
[Service]
- ExecStart=/usr/local/bin/k3s server --cluster-init
+ ExecStart=/usr/local/bin/k3s server --server https://10.0.0.11:6443
  Environment="K3S_TOKEN=your-shared-secret"

After editing:

systemctl daemon-reload && systemctl restart k3s
journalctl -u k3s -f  # watch for 'etcd cluster member added'

Enterprise Best Practice: Use a VIP / Load Balancer + config.yaml

Hardcoding Node 1's IP into --server means Node 1 is a SPOF for the join process. Use a virtual IP (keepalived, kube-vip, or an external LB) and declarative config:

# /etc/rancher/k3s/config.yaml  (Node 1)
- server: "https://10.0.0.11:6443"
+ cluster-init: true
+ tls-san:
+   - "10.0.0.100"   # VIP
+   - "k3s-api.internal.example.com"
+ token-file: "/etc/k3s-token"   # file-based token, not env var

# /etc/rancher/k3s/config.yaml  (Node 2, Node 3)
- cluster-init: true
+ server: "https://10.0.0.100:6443"   # VIP, not Node 1 direct IP
+ tls-san:
+   - "10.0.0.100"
+   - "k3s-api.internal.example.com"
+ token-file: "/etc/k3s-token"

Why --tls-san matters: If the SAN list doesn't include the IP/hostname that joining nodes use to reach the API, TLS verification fails with a certificate error that looks identical to a network failure — a notorious time-waster during incident response.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Validate K3s Config in IaC Before Apply

If you're provisioning with Terraform or Ansible, add a pre-flight check:

# In your Ansible playbook pre-task block
- name: Assert only one node has cluster-init
  assert:
    that:
      - groups['k3s_servers'] | map('extract', hostvars, 'k3s_cluster_init') | select('defined') | list | length == 1
    fail_msg: "Exactly one server node must have cluster_init: true"

2. OPA/Conftest Policy for K3s config.yaml

# policy/k3s_ha.rego
package k3s

deny[msg] {
  input."cluster-init" == true
  input.server != ""
  msg := "A node cannot have both cluster-init and server set simultaneously."
}

deny[msg] {
  not input."cluster-init"
  not input.server
  msg := "Server node must have either cluster-init or server defined."
}

deny[msg] {
  count(input."tls-san") == 0
  msg := "tls-san must include at least the VIP or load balancer hostname."
}

Run in CI:

conftest test /etc/rancher/k3s/config.yaml --policy policy/

3. Smoke Test After Node Provisioning

#!/bin/bash
# post-provision smoke test — fail fast before declaring cluster healthy
set -e
kubectl get nodes --no-headers | awk '{print $2}' | grep -v Ready && echo "FAIL: Not all nodes Ready" && exit 1
ETCD_MEMBERS=$(etcdctl member list 2>/dev/null | wc -l)
[ "$ETCD_MEMBERS" -lt 3 ] && echo "FAIL: etcd quorum not met ($ETCD_MEMBERS members)" && exit 1
echo "PASS: Cluster healthy with $ETCD_MEMBERS etcd members"

Wire this into your Terraform null_resource provisioner or Ansible post_tasks so a broken HA join fails the pipeline — not production.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →