Fixing the Docker cgroupfs vs systemd Cgroup Driver Conflict on Linux
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins
TL;DR
- What broke: Docker is using
cgroupfsas its cgroup driver while systemd (PID 1) owns the cgroup hierarchy — two managers fighting over the same resource, causing OOM kills and kubelet node registration failures. - How to fix it: Set
"exec-opts": ["native.cgroupdriver=systemd"]in/etc/docker/daemon.json, restart Docker, and align the kubelet--cgroup-driverflag. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your
daemon.jsonand kubelet config without pasting secrets into a third-party AI.
The Incident (What Does the Error Mean?)
You'll see one or more of these in your logs:
# journalctl -u docker.service
Failed to create cgroup: cannot enter cgroupv2 namespace: no such file or directory
# journalctl -u kubelet
failed to run Kubelet: failed to create kubelet:
misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
# dmesg
oomkiller: Kill process 3412 (containerd-shim) score 847 or sacrifice child
Immediate consequence: On a systemd host, systemd creates and manages the top-level cgroup hierarchy. When Docker runs its own parallel cgroupfs manager, you get two competing cgroup managers. Memory limits set by Docker are not respected by the kernel's systemd-owned hierarchy. Containers breach their limits silently until the OOM killer fires — often taking down unrelated workloads on the same node.
On Kubernetes, this mismatch causes kubelet to fail node registration entirely. The node never reaches Ready state.
The Attack Vector / Blast Radius
This isn't a misconfiguration you can defer. The blast radius escalates in three stages:
Stage 1 — Memory accounting breaks. cgroupfs and systemd maintain separate accounting trees. A container capped at 512Mi can silently consume 2Gi because the kernel enforces limits only within the authoritative hierarchy (systemd's). Docker's cgroupfs limits are effectively phantom.
Stage 2 — OOM kills become non-deterministic. The kernel OOM killer fires based on actual memory pressure, not your container limits. It will kill whatever process scores highest — which is often your highest-memory container, not the runaway one. Stateful workloads (Postgres, Redis, Kafka) get evicted randomly.
Stage 3 — Kubernetes node instability. Kubelet's eviction manager relies on cgroup-reported metrics via cAdvisor. With a split hierarchy, eviction thresholds are computed from stale or incorrect data. Nodes enter NotReady, pods are rescheduled cluster-wide, and if multiple nodes share the same base image/config, you get a cascading multi-node failure.
How to Fix It
Basic Fix — Align Docker to systemd
Verify your current state first:
# Check host init system
ps -p 1 -o comm=
# Expected: systemd
# Check Docker's current cgroup driver
docker info | grep -i cgroup
# Bad output: Cgroup Driver: cgroupfs
# Good output: Cgroup Driver: systemd
Edit /etc/docker/daemon.json:
{
- "exec-opts": [],
+ "exec-opts": ["native.cgroupdriver=systemd"],
+ "log-driver": "json-file",
+ "log-opts": {
+ "max-size": "100m"
+ },
"storage-driver": "overlay2"
}
Apply and verify:
sudo systemctl daemon-reload
sudo systemctl restart docker
docker info | grep -i cgroup
# Must return: Cgroup Driver: systemd
⚠️ Restarting Docker stops all running containers. Schedule this during a maintenance window or drain the node first if it's in a Kubernetes cluster.
Enterprise Best Practice — Kubernetes Node Alignment (kubeadm / kubelet)
For kubeadm-provisioned clusters, the kubelet config must match. The old flag-based approach is deprecated.
Kubelet config file (/var/lib/kubelet/config.yaml or kubeadm KubeletConfiguration):
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
-cgroupDriver: cgroupfs
+cgroupDriver: systemd
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
kubeadm cluster init (prevent the problem at bootstrap):
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
+ kubeletExtraArgs:
+ cgroup-driver: systemd
For existing nodes:
# Drain the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Apply daemon.json fix above, then:
sudo systemctl restart kubelet
# Verify
kubectl get node <node-name>
# STATUS must return: Ready
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Packer / cloud-init golden image validation
Bake the correct daemon.json into your base AMI or cloud-init template. Add a validation step that fails the image build if the driver is wrong:
# In your Packer post-processor or cloud-init runcmd
docker info --format '{{.CgroupDriver}}' | grep -q systemd || { echo "FATAL: cgroup driver mismatch"; exit 1; }
2. Checkov / OPA policy for Kubernetes manifests
If you manage kubelet configs as code (GitOps), add a Checkov custom check or Conftest OPA policy:
# policy/cgroup_driver.rego
package kubernetes.kubelet
deny[msg] {
input.kind == "KubeletConfiguration"
input.cgroupDriver != "systemd"
msg := "KubeletConfiguration must use cgroupDriver: systemd on systemd hosts"
}
3. Node conformance test in CI
Run kubeadm preflight checks as part of your node provisioning pipeline:
kubeadm init phase preflight --ignore-preflight-errors=all 2>&1 | grep -i cgroup
# Any output here = your pipeline should fail the deployment
4. Monitoring — alert before the OOM fires
Add a Prometheus alert on cgroup driver inconsistency detected via node-exporter custom metrics, or simply alert on kube_node_status_condition{condition="Ready",status="false"} with a runbook link pointing to this fix.