Fixing Docker Swarm 'unless-stopped' Restart Policy Being Silently Ignored in Services
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 5 mins
TL;DR
- What broke: Docker Swarm silently discards
restart_policy: condition: unless-stoppedbecause the Swarm orchestrator does not recognize it — onlyany,on-failure, andnoneare valid. - How to fix it: Replace
unless-stoppedwithany(oron-failure) in your stack'sdeploy.restart_policy.conditionblock and redeploy the service. - Use our Client-Side Sandbox above to auto-refactor this — paste your
docker-compose.ymland get a corrected stack file without sending your config to a third-party server.
The Incident (What Does the Error Mean?)
When you migrate a standalone docker-compose.yml to a Swarm stack, Docker silently drops unsupported restart policy values. You will not get a hard error. The service deploys, appears healthy, and then behaves incorrectly under failure conditions.
Raw symptom output from docker service inspect:
"RestartPolicy": {
"Condition": "any",
"Delay": 0,
"MaxAttempts": 0
}
Swarm has silently coerced your policy — or in some versions, simply defaulted to any with no cap — because unless-stopped is a Docker Engine concept tied to a local daemon tracking explicit docker stop calls. Swarm has no daemon-local state per container. There is no "stopped by the user" signal at the orchestration layer. The concept is architecturally meaningless in a distributed scheduler.
Immediate consequence: Services you intended to leave stopped after a manual halt will restart automatically. Maintenance windows, deliberate scale-to-zero operations, or controlled drains get undermined silently.
The Attack Vector / Blast Radius
This is not a CVE, but the operational blast radius is significant:
Runaway restart loops during incidents. You manually stop a misbehaving service replica during a production incident. Swarm immediately restarts it. You are now fighting the orchestrator during an outage.
Cascading dependency failures. A service dependent on an external resource (DB migration lock, external API rate limit) keeps restarting and hammering the dependency. Without
MaxAttemptsset, this is unbounded.Invisible policy drift from Compose migrations. Teams assume Swarm honored their Compose file. Audits show the running policy is not what the source-controlled file declares. Your Git repo lies about your production state.
Silent coercion means no alert fires. No
docker stack deployerror, no warning indocker service ps, no event indocker events. Engineers discover the mismatch only during post-mortems.
How to Fix It (The Solution)
Basic Fix
Replace the invalid unless-stopped condition with any and add explicit bounds.
services:
api:
image: myorg/api:latest
deploy:
restart_policy:
- condition: unless-stopped
- delay: 5s
+ condition: any
+ delay: 5s
+ max_attempts: 5
+ window: 120s
condition: any — Swarm restarts the task on any exit, which is the closest behavioral equivalent to unless-stopped in a stateless orchestrator context.
Enterprise Best Practice
For services that must not auto-restart after a deliberate operator halt (the original intent of unless-stopped), model the behavior explicitly:
services:
worker:
image: myorg/worker:latest
deploy:
replicas: 3
restart_policy:
- condition: unless-stopped
+ condition: on-failure
+ delay: 10s
+ max_attempts: 3
+ window: 60s
+ update_config:
+ order: start-first
+ failure_action: rollback
+ rollback_config:
+ failure_action: pause
on-failureonly restarts on non-zero exit codes, preventing restarts after clean shutdowns (exit 0), which is the closest semantic match to operator-initiated stops in Swarm.max_attempts: 3prevents unbounded restart loops hammering downstream dependencies.rollback_config: failure_action: pausestops automated rollback loops from compounding the incident.
For true "do not restart after manual stop" semantics, the Swarm-native pattern is scaling replicas to 0:
# Operator-controlled "stop" in Swarm
docker service scale api=0
# Resume
docker service scale api=3
This is the architecturally correct replacement for unless-stopped in a distributed orchestrator.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
This class of silent misconfiguration is entirely preventable with static analysis in your pipeline.
1. Checkov — Swarm Policy Validation
# .checkov.yml
checks:
- CKV_DOCKER_*
# Add custom check for unless-stopped in deploy blocks
Checkov's CKV_DOCKER_9 flags restart policy issues. Add a custom policy for Swarm-specific constraints:
# checkov/custom/SwarmRestartPolicy.py
from checkov.common.models.enums import CheckResult
from checkov.yaml_runner.checks.base_yaml_check import BaseYamlCheck
INVALID_SWARM_CONDITIONS = {"unless-stopped", "always"}
class SwarmRestartPolicyCheck(BaseYamlCheck):
def __init__(self):
super().__init__(
name="Ensure Swarm restart_policy condition is valid",
check_id="CKV_CUSTOM_SWARM_1",
supported_entities=["services"],
block_type="services"
)
def scan_resource_conf(self, conf):
for svc in conf.values():
condition = svc.get("deploy", {}).get("restart_policy", {}).get("condition", "")
if condition in INVALID_SWARM_CONDITIONS:
return CheckResult.FAILED
return CheckResult.PASSED
2. OPA / Conftest Gate
# policy/swarm_restart.rego
package swarm
DENY_CONDITIONS := {"unless-stopped", "always"}
deny[msg] {
svc := input.services[name]
condition := svc.deploy.restart_policy.condition
DENY_CONDITIONS[condition]
msg := sprintf("Service '%v': restart_policy condition '%v' is not valid in Swarm mode. Use 'any', 'on-failure', or 'none'.", [name, condition])
}
# In your CI pipeline
conftest test docker-compose.yml --policy policy/
3. Pre-Deploy Hook (Shell)
#!/bin/bash
# pre-deploy-check.sh — fails the pipeline on invalid Swarm restart policies
if grep -rE 'condition:\s*unless-stopped' docker-compose*.yml; then
echo "ERROR: 'unless-stopped' is not a valid Swarm restart policy condition."
echo "Replace with 'any' or 'on-failure'. See runbook: wiki/swarm-restart-policies"
exit 1
fi
Add this as a required CI step before any docker stack deploy command. Zero dependencies, runs in 200ms.