Initializing Enclave...

How to Fix Docker Swarm 'Update Out of Sync' Error During Service Updates

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–30 mins


TL;DR

  • What broke: docker service update detected a divergence between the manager's desired state record and the actual task/slot state on worker nodes, halting the rollout mid-flight.
  • How to fix it: Force-reset the update state with docker service update --force or roll back with docker service rollback, then identify and evict the stuck task slots causing the desync.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your docker service update command or Compose stack spec and generate the corrective flags without pasting secrets into a public AI.

The Incident (What Does the Error Mean?)

Raw error output from the Docker daemon or CLI:

Error response from daemon: rpc error: code = Unknown
desc = update out of sync

Or during a docker stack deploy:

failed to update service mystack_api: Error response from daemon: update out of sync

Immediate consequence: The service updater on the Raft-leader manager node has detected that the version of the service spec it holds does not match what was last committed to the Raft log, or that in-flight task slots are in a state the reconciler cannot advance. The rolling update is completely halted. No new tasks are scheduled. Existing tasks may be in Running, Failed, or Shutdown states simultaneously — your service is partially updated and in an undefined state.


The Attack Vector / Blast Radius

This is a distributed state consistency failure, not a simple CLI error. The blast radius:

  • Partial rollout lock: Some replicas run the new image, some run the old. Load balancer (the Swarm ingress mesh) routes traffic to both. If the new image has a breaking schema change, you now have split-brain application behavior hitting a single VIP.
  • Cascading update block: Any subsequent docker service update on the same service will also fail until the stuck state is cleared. Automated CD pipelines (Jenkins, GitLab CI, ArgoCD with Swarm plugins) will loop and retry, potentially hammering the manager API and degrading Raft performance for the entire cluster.
  • Manager leader instability: If the desync was caused by a manager failover mid-update (the most common root cause), the new leader may have an incomplete Raft log entry for the update transaction. Forcing another update before clearing this can corrupt the service spec in the Raft store — requiring a full service delete and recreate.
  • Quorum risk: On a 3-manager cluster, if one manager is already degraded (causing the original failover), you are now one manager failure away from losing quorum and freezing the entire cluster's control plane.

Root causes ranked by frequency:

  1. Manager leader election occurred during an active service update transaction
  2. Concurrent docker service update calls on the same service (CI pipeline race condition)
  3. Worker node eviction while tasks were being rescheduled mid-update
  4. Clock skew between manager nodes causing Raft log version conflicts
  5. Corrupted service spec from a partial docker stack deploy

How to Fix It

Step 1: Inspect the current service state

docker service inspect --pretty <service_name>
docker service ps <service_name> --no-trunc

Look for tasks in Preparing, Starting, or Failed state that are not converging. Note the UpdateStatus.State field in the inspect output — it will show updating or paused when stuck.

Step 2: Basic Fix — Force rollback

If the new image/config is suspect or you need to restore service immediately:

docker service rollback <service_name>

This instructs the manager to revert to the previous service spec version stored in Raft. This is the safest first action in a production outage.

Step 3: Basic Fix — Force re-sync with --force

If rollback is not acceptable (e.g., you need the new image deployed):

docker service update --force <service_name>

--force increments the service spec version, creating a new Raft log entry that supersedes the stuck transaction. This triggers a full task reconciliation cycle.

Step 4: Enterprise Best Practice — Controlled re-deployment with update config

The underlying problem is often an aggressive or unguarded update config. Refactor your service spec:

 version: '3.8'
 services:
   api:
     image: myrepo/api:2.1.0
     deploy:
       replicas: 6
       update_config:
-        parallelism: 6
-        failure_action: continue
-        order: start-first
+        parallelism: 2
+        failure_action: rollback
+        monitor: 30s
+        max_failure_ratio: 0.2
+        order: start-first
+      rollback_config:
+        parallelism: 2
+        failure_action: pause
+        monitor: 20s
       restart_policy:
         condition: on-failure
+        delay: 5s
+        max_attempts: 3

Why this matters:

  • parallelism: 6 on a 6-replica service updates all tasks simultaneously — one Raft hiccup mid-flight desyncs the entire update
  • failure_action: continue means the updater never pauses to let the manager reconcile, accelerating the desync
  • monitor: 30s gives the health check subsystem time to report task health before the updater advances to the next batch
  • rollback_config ensures a clean, controlled revert path exists in Raft before the update begins

Step 5: If the service spec is corrupted in Raft

Last resort — only if rollback and --force both fail:

# Capture current spec
docker service inspect <service_name> > service_backup.json

# Remove and recreate (causes downtime — drain connections first)
docker service rm <service_name>
docker stack deploy -c docker-compose.yml <stack_name>

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Serialize service updates in your pipeline

Never allow concurrent docker service update calls on the same service. In GitLab CI or GitHub Actions:

# GitHub Actions — use concurrency groups to serialize Swarm deployments
concurrency:
  group: swarm-deploy-${{ github.ref }}
  cancel-in-progress: false

2. Poll for convergence before proceeding

Do not fire-and-forget docker service update. Block the pipeline until convergence:

#!/bin/bash
# wait-for-service-stable.sh
SERVICE=$1
TIMEOUT=120
ELAPSED=0

until [ "$(docker service inspect $SERVICE --format '{{.UpdateStatus.State}}')" = "completed" ]; do
  if [ $ELAPSED -ge $TIMEOUT ]; then
    echo "ERROR: Service update did not converge in ${TIMEOUT}s"
    docker service rollback $SERVICE
    exit 1
  fi
  sleep 5
  ELAPSED=$((ELAPSED+5))
done

3. Validate Compose files with docker-compose config in CI

# Pre-deploy lint step
- name: Validate stack config
  run: |
    docker-compose -f docker-compose.prod.yml config --quiet
    # Fail fast on spec errors before touching the live cluster

4. OPA/Conftest policy to enforce safe update_config

# policy/swarm_update_config.rego
package swarm

deny[msg] {
  svc := input.services[_]
  svc.deploy.update_config.failure_action == "continue"
  msg := sprintf("Service '%v': failure_action must not be 'continue' — use 'rollback' or 'pause'", [svc.name])
}

deny[msg] {
  svc := input.services[_]
  svc.deploy.update_config.parallelism > 2
  svc.deploy.replicas <= 4
  msg := sprintf("Service '%v': parallelism too high relative to replica count — risk of full-service desync", [svc.name])
}

Run in CI with conftest test docker-compose.prod.yml --policy policy/.

5. Monitor manager health proactively

# Alert if manager count drops below quorum
docker node ls --filter role=manager --format '{{.Status}}' | grep -c Ready
# If < 2 on a 3-manager cluster, page immediately — next service update is a desync risk

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →