Why does 'nvidia-smi' work on the host but Docker --gpus all still fails?

nvidia-smi passing means the kernel driver is loaded, but Docker GPU access requires two additional layers: nvidia-container-toolkit must be installed, and the Docker daemon must be configured with the nvidia runtime in daemon.json. If either is missing, the container runtime cannot initialize the GPU even though the host driver is functional. Run 'docker info | grep -i runtime' to confirm nvidia appears in the runtime list.

What is the difference between nvidia-docker2 and nvidia-container-toolkit?

nvidia-docker2 is the legacy package that wrapped the Docker CLI. nvidia-container-toolkit (libnvidia-container + nvidia-container-runtime) is the current implementation and integrates directly with the OCI runtime spec. On any host running Docker 19.03+, you should use nvidia-container-toolkit exclusively. nvidia-docker2 is deprecated and conflicts with toolkit installations on some distros.

How do I fix the nvidia-container-cli initialization error on Ubuntu 22.04 with cgroupv2?

Ubuntu 22.04 uses cgroupv2 by default. Older nvidia-container-toolkit versions (pre-1.12) fail silently with cgroupv2 when no-cgroups is set to false. Fix: update nvidia-container-toolkit to 1.14+ via the official NVIDIA repo, then set no-cgroups = true in /etc/nvidia-container-runtime/config.toml, and restart Docker. Alternatively, you can force cgroupv1 at boot with systemd.unified_cgroup_hierarchy=0 in GRUB, but this is not recommended for new deployments.

How to Fix 'nvidia-container-cli: initialization error' When Running Docker with --gpus all

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–30 mins depending on driver state

TL;DR

What broke: Docker cannot initialize the NVIDIA container runtime because the host is missing the NVIDIA kernel driver, nvidia-container-toolkit, or the Docker daemon is not configured to use the nvidia runtime.
How to fix it: Install the correct NVIDIA driver for your kernel, install nvidia-container-toolkit, and register the NVIDIA runtime with the Docker daemon.
Fast path: Use our Client-Side Sandbox below to paste your docker info, nvidia-smi, and daemon.json — it will auto-diagnose and generate the exact fix commands for your distro.

The Incident (What Does the Error Mean?)

Raw error output from docker run --gpus all nvidia/cuda:12.0-base nvidia-smi:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
# or more specifically:
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

Or the variant when the toolkit is missing entirely:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed:
container_linux.go:380: starting container process caused: process_linux.go:545:
running hook #0 caused: error running hook: exit status 1, stdout: , stderr:
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1:
cannot open shared object file: no such file or directory: unknown.

Immediate consequence: Every GPU workload — model inference, training jobs, CUDA-accelerated pipelines — is dead on arrival. The container never starts. If this is a Kubernetes node, the entire node's GPU capacity is offline.

The Attack Vector / Blast Radius

This is a hard infrastructure misconfiguration, not a soft warning. The blast radius depends on your workload:

ML Training Cluster: A single misconfigured node silently accepts GPU pod scheduling from Kubernetes but fails every container start. Jobs retry, burn queue time, and produce zero output. You may not notice for hours if health checks only validate pod scheduling, not runtime execution.
Inference API: Cold-start latency becomes infinite. If your orchestrator has no fallback, the service is fully down.
CI/CD GPU Pipelines: Every build that runs GPU tests fails. If your pipeline doesn't distinguish this error from a test failure, you get false negatives in your test suite.
Root cause cascade: There are four distinct failure modes that produce this same error. Treating the symptom without identifying which layer is broken wastes time and risks a kernel panic if you force-install mismatched drivers.

The four failure layers (check in this order):

NVIDIA kernel driver not installed or not loaded (nvidia-smi fails)
nvidia-container-toolkit (formerly nvidia-docker2) not installed
Docker daemon not configured to use the nvidia runtime
cgroup version mismatch (cgroupv2 on older toolkit versions)

How to Fix It (The Solution)

Step 0: Triage — Identify Which Layer Is Broken

# Test layer 1: kernel driver
nvidia-smi

# Test layer 2 & 3: container runtime
docker info | grep -i runtime

# Test layer 4: cgroup
stat -fc %T /sys/fs/cgroup/

Basic Fix — Ubuntu/Debian (Most Common Case)

Layer 1: Install NVIDIA Driver

# Do NOT use apt's default nouveau-conflicting package
# Use the official NVIDIA repo or ubuntu-drivers
sudo apt-get install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot

Post-reboot, nvidia-smi must return a valid driver version and GPU table before proceeding.

Layer 2 & 3: Install nvidia-container-toolkit and configure Docker

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Enterprise Best Practice — daemon.json Hardening

Do not rely on nvidia-ctk auto-configuration in production. Pin the runtime explicitly and set resource constraints.

# /etc/docker/daemon.json
{
-  "runtimes": {}
+  "default-runtime": "nvidia",
+  "runtimes": {
+    "nvidia": {
+      "path": "/usr/bin/nvidia-container-runtime",
+      "runtimeArgs": []
+    }
+  },
+  "exec-opts": ["native.cgroupdriver=systemd"],
+  "log-driver": "json-file",
+  "log-opts": {
+    "max-size": "100m",
+    "max-file": "3"
+  }
}

For cgroupv2 hosts (Ubuntu 22.04+, RHEL 9+):

# /etc/nvidia-container-runtime/config.toml
[nvidia-container-cli]
- no-cgroups = false
+ no-cgroups = true

This is the single most common missed step on modern kernels. cgroupv2 with no-cgroups = false causes silent initialization failures that look identical to a missing driver.

Verification:

docker run --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi
# Must return GPU table. Exit code must be 0.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Pre-flight GPU node validation script (run at instance bootstrap, not at job time):

#!/bin/bash
# /usr/local/bin/gpu-preflight.sh
set -euo pipefail

nvidia-smi > /dev/null 2>&1 || { echo "FATAL: NVIDIA driver not loaded"; exit 1; }
docker info 2>/dev/null | grep -q 'nvidia' || { echo "FATAL: nvidia runtime not registered"; exit 1; }
docker run --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi > /dev/null 2>&1 \
  || { echo "FATAL: GPU container execution failed"; exit 1; }
echo "GPU preflight PASSED"

2. Packer / cloud-init: Bake the driver + toolkit into your AMI or GCP image. Never install drivers at runtime on a live node. Use packer build with the GPU preflight script as a post-provisioner validate step.

3. Kubernetes — Node Feature Discovery + GPU Operator:

# Use NVIDIA GPU Operator — do not manage drivers manually on K8s nodes
# helm install gpu-operator nvidia/gpu-operator \
#   --set driver.enabled=true \
#   --set toolkit.enabled=true

# Validate node labels post-install:
kubectl get nodes -l nvidia.com/gpu.present=true

4. Checkov / OPA policy — flag GPU workloads missing resource limits:

# OPA Rego: deny GPU containers without explicit resource limits
deny[msg] {
  container := input.spec.containers[_]
  container.resources.limits["nvidia.com/gpu"]
  not container.resources.requests["nvidia.com/gpu"]
  msg := sprintf("Container '%v' requests GPU without explicit resource request", [container.name])
}

5. GitHub Actions self-hosted runner validation:

- name: GPU Runtime Preflight
  run: /usr/local/bin/gpu-preflight.sh
  # Fail fast before any CUDA workload step

Never let a GPU job reach the training or inference step without a validated preflight. The error surfaces late, wastes compute budget, and produces confusing logs that obscure the real infrastructure failure.