Fixing Docker Swarm Overlay Network Split Brain: Containers Can't Ping Each Other
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 20–45 mins
TL;DR
- What broke: The Swarm overlay network's VXLAN/gossip state diverged between nodes — containers on different hosts are isolated, service discovery is dead, and inter-service traffic is silently dropped.
- How to fix it: Force-remove the stale overlay network, purge dangling VXLAN interfaces, re-advertise node state, and recreate the network with explicit subnet pinning.
- CTA: Use our Client-Side Sandbox above to paste your
docker network inspectoutput and get auto-generated remediation commands without leaking your config to a third-party AI.
The Incident (What Does the Error Mean?)
You're staring at this on a worker node:
$ docker exec -it service_replica_1 ping 10.0.1.5
ping: connect: Network is unreachable
$ docker network inspect my_overlay
[
{
"Name": "my_overlay",
"Id": "q7k2p...",
"Peers": [
{ "Name": "node1", "IP": "192.168.1.10" }
]
// node2, node3 MISSING from Peers list
}
]
The Peers list is incomplete. Node2 and Node3 aren't visible to Node1's overlay driver. VXLAN tunnel endpoints (VTEPs) were never established or have gone stale. The docker_gwbridge is up, the physical network is fine — but the overlay's control plane has diverged. Containers on different hosts are in separate, invisible L2 segments.
Immediate consequence: All cross-node service mesh traffic — internal DNS resolution via embedded DNS (127.0.0.11), service VIP routing, and direct container IPs — is dead for affected replicas.
The Attack Vector / Blast Radius
This isn't a graceful degradation. The failure mode is silent packet drop, which is the worst kind:
- Services don't crash — they hang on connection timeouts. Your health checks may still pass if they're local-only.
- Cascading timeouts: Any service doing synchronous inter-service calls (gRPC, HTTP) will exhaust its thread pool waiting on dead connections. This cascades upstream.
- Split-brain amplification: If this happened during a rolling update or node drain, you now have two isolated clusters of replicas both believing they are the authoritative instance — potential for dual-write corruption in stateful services.
- Root causes that trigger this:
- Node rejoined the swarm after a hard reboot without a clean
docker swarm leave - Network partition between managers caused gossip (Serf/memberlist) to mark nodes as failed, then the partition healed but VXLAN state was never reconciled
- Kernel VXLAN module was reloaded or
ip_vsstate was flushed (common after kernel upgrades) - MTU mismatch between the overlay (default 1450) and the underlay (e.g., cloud provider using 1500 with no headroom for VXLAN encapsulation overhead)
- Node rejoined the swarm after a hard reboot without a clean
How to Fix It
Step 0: Confirm the Split-Brain
# Run on EACH node — compare Peers lists
docker network inspect <overlay_name> --format '{{json .Peers}}' | python3 -m json.tool
# Check VXLAN interfaces exist
ip -d link show type vxlan
# Check if the FDB (forwarding database) has entries for other nodes
bridge fdb show dev vx-<network_short_id>
If bridge fdb shows no entries for remote node MACs, the VTEP has no path. This confirms split-brain.
Basic Fix: Force Network Reconciliation
- # Leaving the stale overlay attached and hoping swarm self-heals
- # (it will not — gossip convergence does not rebuild VXLAN FDB automatically)
+ # Step 1: Scale down all services using the broken network
+ docker service scale my_service=0
+
+ # Step 2: On ALL worker nodes, remove dangling VXLAN interfaces
+ # Find the vxlan interface name first:
+ ip link show type vxlan
+ # Then delete it:
+ sudo ip link delete vx-<short_network_id>
+
+ # Step 3: Remove the overlay network from the manager
+ docker network rm my_overlay
+
+ # Step 4: Recreate with explicit, pinned subnet
+ docker network create \
+ --driver overlay \
+ --subnet 10.0.9.0/24 \
+ --opt com.docker.network.driver.mtu=1450 \
+ my_overlay
+
+ # Step 5: Redeploy
+ docker service scale my_service=3
Enterprise Best Practice: Prevent Gossip Desync at the Node Level
- # Default swarm init — no advertise-addr, no data-path-addr separation
- docker swarm init
+ # Explicitly separate control plane and data plane traffic
+ # Control plane (gossip, raft) on management NIC
+ # Data plane (VXLAN encap) on high-throughput NIC
+ docker swarm init \
+ --advertise-addr eth0:2377 \
+ --data-path-addr eth1 \
+ --data-path-port 4789
+
+ # On worker join:
+ docker swarm join \
+ --advertise-addr eth0 \
+ --data-path-addr eth1 \
+ --token <worker-token> \
+ <manager-ip>:2377
- # No MTU configuration in compose — inherits host default (1500)
- networks:
- my_overlay:
- driver: overlay
+ # Pin MTU explicitly — 1450 for standard VXLAN, 1400 if running over IPSec/WireGuard
+ networks:
+ my_overlay:
+ driver: overlay
+ driver_opts:
+ com.docker.network.driver.mtu: "1450"
+ ipam:
+ config:
+ - subnet: 10.0.9.0/24
Nuclear Option (Full Node Rejoin)
If the above doesn't resolve it — the node's swarm state is corrupted:
# On the affected worker:
docker swarm leave --force
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/swarm
sudo systemctl start docker
docker swarm join --token <worker-token> <manager>:2377
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. MTU Validation in Pre-Deploy Checks
# Add to your deploy pipeline before any docker stack deploy
NODE_MTU=$(ip link show eth0 | grep -oP 'mtu \K[0-9]+')
if [ "$NODE_MTU" -lt 1500 ]; then
echo "ERROR: Underlay MTU $NODE_MTU too low for default overlay. Set overlay MTU to $((NODE_MTU - 50))"
exit 1
fi
2. OPA/Conftest Policy for Compose Files
# policy/overlay_mtu.rego
package docker.compose
deny[msg] {
net := input.networks[_]
net.driver == "overlay"
not net.driver_opts["com.docker.network.driver.mtu"]
msg := sprintf("Overlay network '%v' must explicitly set MTU via driver_opts", [net])
}
# In CI:
conftest test docker-compose.yml --policy policy/
3. Swarm Node Health Monitor (Cron/Prometheus)
#!/bin/bash
# Run on manager — alert if any node's overlay peer count drops below expected
EXPECTED_PEERS=$(docker node ls --filter role=worker -q | wc -l)
ACTUAL_PEERS=$(docker network inspect my_overlay --format '{{len .Peers}}')
if [ "$ACTUAL_PEERS" -lt "$EXPECTED_PEERS" ]; then
echo "ALERT: Overlay split-brain detected. Expected $EXPECTED_PEERS peers, found $ACTUAL_PEERS"
# Push to PagerDuty/Alertmanager here
fi
4. Kernel/Docker Version Pinning
VXLAN behavior changed significantly between kernel 4.x and 5.x, and between Docker Engine 20.x and 24.x. Pin both in your AMI/base image build and gate kernel upgrades behind a full swarm network regression test.
# In your node base image Dockerfile
FROM ubuntu:22.04
RUN apt-get install -y docker-ce=5:24.0.7* linux-image-5.15.0-91-generic