Initializing Enclave...

Fixing Docker Swarm 'unless-stopped' Restart Policy Being Silently Ignored in Services

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5 mins


TL;DR

  • What broke: Docker Swarm silently discards restart_policy: condition: unless-stopped because the Swarm orchestrator does not recognize it — only any, on-failure, and none are valid.
  • How to fix it: Replace unless-stopped with any (or on-failure) in your stack's deploy.restart_policy.condition block and redeploy the service.
  • Use our Client-Side Sandbox above to auto-refactor this — paste your docker-compose.yml and get a corrected stack file without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

When you migrate a standalone docker-compose.yml to a Swarm stack, Docker silently drops unsupported restart policy values. You will not get a hard error. The service deploys, appears healthy, and then behaves incorrectly under failure conditions.

Raw symptom output from docker service inspect:

"RestartPolicy": {
    "Condition": "any",
    "Delay": 0,
    "MaxAttempts": 0
}

Swarm has silently coerced your policy — or in some versions, simply defaulted to any with no cap — because unless-stopped is a Docker Engine concept tied to a local daemon tracking explicit docker stop calls. Swarm has no daemon-local state per container. There is no "stopped by the user" signal at the orchestration layer. The concept is architecturally meaningless in a distributed scheduler.

Immediate consequence: Services you intended to leave stopped after a manual halt will restart automatically. Maintenance windows, deliberate scale-to-zero operations, or controlled drains get undermined silently.


The Attack Vector / Blast Radius

This is not a CVE, but the operational blast radius is significant:

  1. Runaway restart loops during incidents. You manually stop a misbehaving service replica during a production incident. Swarm immediately restarts it. You are now fighting the orchestrator during an outage.

  2. Cascading dependency failures. A service dependent on an external resource (DB migration lock, external API rate limit) keeps restarting and hammering the dependency. Without MaxAttempts set, this is unbounded.

  3. Invisible policy drift from Compose migrations. Teams assume Swarm honored their Compose file. Audits show the running policy is not what the source-controlled file declares. Your Git repo lies about your production state.

  4. Silent coercion means no alert fires. No docker stack deploy error, no warning in docker service ps, no event in docker events. Engineers discover the mismatch only during post-mortems.


How to Fix It (The Solution)

Basic Fix

Replace the invalid unless-stopped condition with any and add explicit bounds.

 services:
   api:
     image: myorg/api:latest
     deploy:
       restart_policy:
-        condition: unless-stopped
-        delay: 5s
+        condition: any
+        delay: 5s
+        max_attempts: 5
+        window: 120s

condition: any — Swarm restarts the task on any exit, which is the closest behavioral equivalent to unless-stopped in a stateless orchestrator context.

Enterprise Best Practice

For services that must not auto-restart after a deliberate operator halt (the original intent of unless-stopped), model the behavior explicitly:

 services:
   worker:
     image: myorg/worker:latest
     deploy:
       replicas: 3
       restart_policy:
-        condition: unless-stopped
+        condition: on-failure
+        delay: 10s
+        max_attempts: 3
+        window: 60s
+      update_config:
+        order: start-first
+        failure_action: rollback
+      rollback_config:
+        failure_action: pause
  • on-failure only restarts on non-zero exit codes, preventing restarts after clean shutdowns (exit 0), which is the closest semantic match to operator-initiated stops in Swarm.
  • max_attempts: 3 prevents unbounded restart loops hammering downstream dependencies.
  • rollback_config: failure_action: pause stops automated rollback loops from compounding the incident.

For true "do not restart after manual stop" semantics, the Swarm-native pattern is scaling replicas to 0:

# Operator-controlled "stop" in Swarm
docker service scale api=0

# Resume
docker service scale api=3

This is the architecturally correct replacement for unless-stopped in a distributed orchestrator.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

This class of silent misconfiguration is entirely preventable with static analysis in your pipeline.

1. Checkov — Swarm Policy Validation

# .checkov.yml
checks:
  - CKV_DOCKER_*
# Add custom check for unless-stopped in deploy blocks

Checkov's CKV_DOCKER_9 flags restart policy issues. Add a custom policy for Swarm-specific constraints:

# checkov/custom/SwarmRestartPolicy.py
from checkov.common.models.enums import CheckResult
from checkov.yaml_runner.checks.base_yaml_check import BaseYamlCheck

INVALID_SWARM_CONDITIONS = {"unless-stopped", "always"}

class SwarmRestartPolicyCheck(BaseYamlCheck):
    def __init__(self):
        super().__init__(
            name="Ensure Swarm restart_policy condition is valid",
            check_id="CKV_CUSTOM_SWARM_1",
            supported_entities=["services"],
            block_type="services"
        )

    def scan_resource_conf(self, conf):
        for svc in conf.values():
            condition = svc.get("deploy", {}).get("restart_policy", {}).get("condition", "")
            if condition in INVALID_SWARM_CONDITIONS:
                return CheckResult.FAILED
        return CheckResult.PASSED

2. OPA / Conftest Gate

# policy/swarm_restart.rego
package swarm

DENY_CONDITIONS := {"unless-stopped", "always"}

deny[msg] {
  svc := input.services[name]
  condition := svc.deploy.restart_policy.condition
  DENY_CONDITIONS[condition]
  msg := sprintf("Service '%v': restart_policy condition '%v' is not valid in Swarm mode. Use 'any', 'on-failure', or 'none'.", [name, condition])
}
# In your CI pipeline
conftest test docker-compose.yml --policy policy/

3. Pre-Deploy Hook (Shell)

#!/bin/bash
# pre-deploy-check.sh — fails the pipeline on invalid Swarm restart policies
if grep -rE 'condition:\s*unless-stopped' docker-compose*.yml; then
  echo "ERROR: 'unless-stopped' is not a valid Swarm restart policy condition."
  echo "Replace with 'any' or 'on-failure'. See runbook: wiki/swarm-restart-policies"
  exit 1
fi

Add this as a required CI step before any docker stack deploy command. Zero dependencies, runs in 200ms.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →