Why does Docker Swarm overlay split-brain happen after a node reboot?

When a node hard-reboots, the kernel VXLAN interfaces and their forwarding databases (FDB) are destroyed. If the node rejoins the swarm without a clean `docker swarm leave`, the manager still holds stale peer state. The gossip protocol (Serf/memberlist) may eventually reconcile membership, but it does NOT rebuild VXLAN FDB entries automatically — that requires the overlay network driver to re-establish VTEPs, which only happens reliably on a clean network removal and recreation.

How do I check if my overlay MTU mismatch is causing the split-brain?

Run `ping -M do -s 1422 ` from inside a container (1422 bytes + 28 bytes ICMP header = 1450, the standard VXLAN payload limit). If this fails but `ping -s 1000` succeeds, you have an MTU mismatch. The fix is to set `com.docker.network.driver.mtu` to at least 50 bytes below your underlay interface MTU. For cloud providers like AWS with Jumbo Frames disabled, use 1450. Over VPN tunnels, use 1400 or lower.

Can I fix the overlay split-brain without downtime by restarting the Docker daemon?

Rarely, and it's unreliable. Restarting dockerd on the affected worker will cause it to re-register with the manager and attempt to rebuild overlay state, but stale VXLAN interfaces in the kernel may persist. The only guaranteed zero-data-loss approach (with downtime) is: scale services to zero, delete the overlay network, recreate it, and redeploy. If you need zero-downtime, pre-provision a second overlay network, migrate services to it via rolling update, then delete the broken one.

Fixing Docker Swarm Overlay Network Split Brain: Containers Can't Ping Each Other

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 20–45 mins

TL;DR

What broke: The Swarm overlay network's VXLAN/gossip state diverged between nodes — containers on different hosts are isolated, service discovery is dead, and inter-service traffic is silently dropped.
How to fix it: Force-remove the stale overlay network, purge dangling VXLAN interfaces, re-advertise node state, and recreate the network with explicit subnet pinning.
CTA: Use our Client-Side Sandbox above to paste your docker network inspect output and get auto-generated remediation commands without leaking your config to a third-party AI.

The Incident (What Does the Error Mean?)

You're staring at this on a worker node:

$ docker exec -it service_replica_1 ping 10.0.1.5
ping: connect: Network is unreachable

$ docker network inspect my_overlay
[
  {
    "Name": "my_overlay",
    "Id": "q7k2p...",
    "Peers": [
      { "Name": "node1", "IP": "192.168.1.10" }
    ]
    // node2, node3 MISSING from Peers list
  }
]

The Peers list is incomplete. Node2 and Node3 aren't visible to Node1's overlay driver. VXLAN tunnel endpoints (VTEPs) were never established or have gone stale. The docker_gwbridge is up, the physical network is fine — but the overlay's control plane has diverged. Containers on different hosts are in separate, invisible L2 segments.

Immediate consequence: All cross-node service mesh traffic — internal DNS resolution via embedded DNS (127.0.0.11), service VIP routing, and direct container IPs — is dead for affected replicas.

The Attack Vector / Blast Radius

This isn't a graceful degradation. The failure mode is silent packet drop, which is the worst kind:

Services don't crash — they hang on connection timeouts. Your health checks may still pass if they're local-only.
Cascading timeouts: Any service doing synchronous inter-service calls (gRPC, HTTP) will exhaust its thread pool waiting on dead connections. This cascades upstream.
Split-brain amplification: If this happened during a rolling update or node drain, you now have two isolated clusters of replicas both believing they are the authoritative instance — potential for dual-write corruption in stateful services.
Root causes that trigger this:
1. Node rejoined the swarm after a hard reboot without a clean docker swarm leave
2. Network partition between managers caused gossip (Serf/memberlist) to mark nodes as failed, then the partition healed but VXLAN state was never reconciled
3. Kernel VXLAN module was reloaded or ip_vs state was flushed (common after kernel upgrades)
4. MTU mismatch between the overlay (default 1450) and the underlay (e.g., cloud provider using 1500 with no headroom for VXLAN encapsulation overhead)

How to Fix It

Step 0: Confirm the Split-Brain

# Run on EACH node — compare Peers lists
docker network inspect <overlay_name> --format '{{json .Peers}}' | python3 -m json.tool

# Check VXLAN interfaces exist
ip -d link show type vxlan

# Check if the FDB (forwarding database) has entries for other nodes
bridge fdb show dev vx-<network_short_id>

If bridge fdb shows no entries for remote node MACs, the VTEP has no path. This confirms split-brain.

Basic Fix: Force Network Reconciliation

- # Leaving the stale overlay attached and hoping swarm self-heals
- # (it will not — gossip convergence does not rebuild VXLAN FDB automatically)

+ # Step 1: Scale down all services using the broken network
+ docker service scale my_service=0
+
+ # Step 2: On ALL worker nodes, remove dangling VXLAN interfaces
+ # Find the vxlan interface name first:
+ ip link show type vxlan
+ # Then delete it:
+ sudo ip link delete vx-<short_network_id>
+
+ # Step 3: Remove the overlay network from the manager
+ docker network rm my_overlay
+
+ # Step 4: Recreate with explicit, pinned subnet
+ docker network create \
+   --driver overlay \
+   --subnet 10.0.9.0/24 \
+   --opt com.docker.network.driver.mtu=1450 \
+   my_overlay
+
+ # Step 5: Redeploy
+ docker service scale my_service=3

Enterprise Best Practice: Prevent Gossip Desync at the Node Level

- # Default swarm init — no advertise-addr, no data-path-addr separation
- docker swarm init

+ # Explicitly separate control plane and data plane traffic
+ # Control plane (gossip, raft) on management NIC
+ # Data plane (VXLAN encap) on high-throughput NIC
+ docker swarm init \
+   --advertise-addr eth0:2377 \
+   --data-path-addr eth1 \
+   --data-path-port 4789
+
+ # On worker join:
+ docker swarm join \
+   --advertise-addr eth0 \
+   --data-path-addr eth1 \
+   --token <worker-token> \
+   <manager-ip>:2377

- # No MTU configuration in compose — inherits host default (1500)
- networks:
-   my_overlay:
-     driver: overlay

+ # Pin MTU explicitly — 1450 for standard VXLAN, 1400 if running over IPSec/WireGuard
+ networks:
+   my_overlay:
+     driver: overlay
+     driver_opts:
+       com.docker.network.driver.mtu: "1450"
+     ipam:
+       config:
+         - subnet: 10.0.9.0/24

Nuclear Option (Full Node Rejoin)

If the above doesn't resolve it — the node's swarm state is corrupted:

# On the affected worker:
docker swarm leave --force
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/swarm
sudo systemctl start docker
docker swarm join --token <worker-token> <manager>:2377

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. MTU Validation in Pre-Deploy Checks

# Add to your deploy pipeline before any docker stack deploy
NODE_MTU=$(ip link show eth0 | grep -oP 'mtu \K[0-9]+')
if [ "$NODE_MTU" -lt 1500 ]; then
  echo "ERROR: Underlay MTU $NODE_MTU too low for default overlay. Set overlay MTU to $((NODE_MTU - 50))"
  exit 1
fi

2. OPA/Conftest Policy for Compose Files

# policy/overlay_mtu.rego
package docker.compose

deny[msg] {
  net := input.networks[_]
  net.driver == "overlay"
  not net.driver_opts["com.docker.network.driver.mtu"]
  msg := sprintf("Overlay network '%v' must explicitly set MTU via driver_opts", [net])
}

# In CI:
conftest test docker-compose.yml --policy policy/

3. Swarm Node Health Monitor (Cron/Prometheus)

#!/bin/bash
# Run on manager — alert if any node's overlay peer count drops below expected
EXPECTED_PEERS=$(docker node ls --filter role=worker -q | wc -l)
ACTUAL_PEERS=$(docker network inspect my_overlay --format '{{len .Peers}}')

if [ "$ACTUAL_PEERS" -lt "$EXPECTED_PEERS" ]; then
  echo "ALERT: Overlay split-brain detected. Expected $EXPECTED_PEERS peers, found $ACTUAL_PEERS"
  # Push to PagerDuty/Alertmanager here
fi

4. Kernel/Docker Version Pinning

VXLAN behavior changed significantly between kernel 4.x and 5.x, and between Docker Engine 20.x and 24.x. Pin both in your AMI/base image build and gate kernel upgrades behind a full swarm network regression test.

# In your node base image Dockerfile
FROM ubuntu:22.04
RUN apt-get install -y docker-ce=5:24.0.7* linux-image-5.15.0-91-generic