Initializing Enclave...

Fixing the Docker cgroupfs vs systemd Cgroup Driver Conflict on Linux

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins

TL;DR

  • What broke: Docker is using cgroupfs as its cgroup driver while systemd (PID 1) owns the cgroup hierarchy — two managers fighting over the same resource, causing OOM kills and kubelet node registration failures.
  • How to fix it: Set "exec-opts": ["native.cgroupdriver=systemd"] in /etc/docker/daemon.json, restart Docker, and align the kubelet --cgroup-driver flag.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your daemon.json and kubelet config without pasting secrets into a third-party AI.

The Incident (What Does the Error Mean?)

You'll see one or more of these in your logs:

# journalctl -u docker.service
Failed to create cgroup: cannot enter cgroupv2 namespace: no such file or directory

# journalctl -u kubelet
failed to run Kubelet: failed to create kubelet: 
misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"

# dmesg
oomkiller: Kill process 3412 (containerd-shim) score 847 or sacrifice child

Immediate consequence: On a systemd host, systemd creates and manages the top-level cgroup hierarchy. When Docker runs its own parallel cgroupfs manager, you get two competing cgroup managers. Memory limits set by Docker are not respected by the kernel's systemd-owned hierarchy. Containers breach their limits silently until the OOM killer fires — often taking down unrelated workloads on the same node.

On Kubernetes, this mismatch causes kubelet to fail node registration entirely. The node never reaches Ready state.


The Attack Vector / Blast Radius

This isn't a misconfiguration you can defer. The blast radius escalates in three stages:

Stage 1 — Memory accounting breaks. cgroupfs and systemd maintain separate accounting trees. A container capped at 512Mi can silently consume 2Gi because the kernel enforces limits only within the authoritative hierarchy (systemd's). Docker's cgroupfs limits are effectively phantom.

Stage 2 — OOM kills become non-deterministic. The kernel OOM killer fires based on actual memory pressure, not your container limits. It will kill whatever process scores highest — which is often your highest-memory container, not the runaway one. Stateful workloads (Postgres, Redis, Kafka) get evicted randomly.

Stage 3 — Kubernetes node instability. Kubelet's eviction manager relies on cgroup-reported metrics via cAdvisor. With a split hierarchy, eviction thresholds are computed from stale or incorrect data. Nodes enter NotReady, pods are rescheduled cluster-wide, and if multiple nodes share the same base image/config, you get a cascading multi-node failure.


How to Fix It

Basic Fix — Align Docker to systemd

Verify your current state first:

# Check host init system
ps -p 1 -o comm=
# Expected: systemd

# Check Docker's current cgroup driver
docker info | grep -i cgroup
# Bad output: Cgroup Driver: cgroupfs
# Good output: Cgroup Driver: systemd

Edit /etc/docker/daemon.json:

{
-  "exec-opts": [],
+  "exec-opts": ["native.cgroupdriver=systemd"],
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "100m"
+  },
   "storage-driver": "overlay2"
}

Apply and verify:

sudo systemctl daemon-reload
sudo systemctl restart docker
docker info | grep -i cgroup
# Must return: Cgroup Driver: systemd

⚠️ Restarting Docker stops all running containers. Schedule this during a maintenance window or drain the node first if it's in a Kubernetes cluster.


Enterprise Best Practice — Kubernetes Node Alignment (kubeadm / kubelet)

For kubeadm-provisioned clusters, the kubelet config must match. The old flag-based approach is deprecated.

Kubelet config file (/var/lib/kubelet/config.yaml or kubeadm KubeletConfiguration):

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
-cgroupDriver: cgroupfs
+cgroupDriver: systemd
 evictionHard:
   memory.available: "200Mi"
   nodefs.available: "10%"

kubeadm cluster init (prevent the problem at bootstrap):

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
+  kubeletExtraArgs:
+    cgroup-driver: systemd

For existing nodes:

# Drain the node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Apply daemon.json fix above, then:
sudo systemctl restart kubelet

# Verify
kubectl get node <node-name>
# STATUS must return: Ready

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Packer / cloud-init golden image validation

Bake the correct daemon.json into your base AMI or cloud-init template. Add a validation step that fails the image build if the driver is wrong:

# In your Packer post-processor or cloud-init runcmd
docker info --format '{{.CgroupDriver}}' | grep -q systemd || { echo "FATAL: cgroup driver mismatch"; exit 1; }

2. Checkov / OPA policy for Kubernetes manifests

If you manage kubelet configs as code (GitOps), add a Checkov custom check or Conftest OPA policy:

# policy/cgroup_driver.rego
package kubernetes.kubelet

deny[msg] {
  input.kind == "KubeletConfiguration"
  input.cgroupDriver != "systemd"
  msg := "KubeletConfiguration must use cgroupDriver: systemd on systemd hosts"
}

3. Node conformance test in CI

Run kubeadm preflight checks as part of your node provisioning pipeline:

kubeadm init phase preflight --ignore-preflight-errors=all 2>&1 | grep -i cgroup
# Any output here = your pipeline should fail the deployment

4. Monitoring — alert before the OOM fires

Add a Prometheus alert on cgroup driver inconsistency detected via node-exporter custom metrics, or simply alert on kube_node_status_condition{condition="Ready",status="false"} with a runbook link pointing to this fix.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →