Fixing Containerd CRI 'Failed to Create Pod Sandbox' Caused by Image Pull Secret Mismatch
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins
TL;DR
- What broke: Containerd's CRI shim cannot pull the pause/sandbox image (or the app image) because the
imagePullSecretreferenced in the Pod spec either doesn't exist in the correct namespace, has a malformed.dockerconfigjson, or references the wrong registry host. - How to fix it: Verify the Secret exists in the same namespace as the Pod, decode and validate the
.dockerconfigjsonpayload, and ensure the registry hostname in the secret matches the image reference exactly. - Fast path: Use our Client-Side Sandbox above to auto-refactor your failing Pod spec and Secret manifest — paste both, get corrected YAML output instantly.
The Incident (What Does the Error Mean?)
You'll see this in kubectl describe pod <pod-name> or in the kubelet journal:
Warning Failed 3s kubelet Failed to create pod sandbox:
rpc error: code = Unknown
desc = failed to pull image "registry.internal.corp/pause:3.9":
failed to pull and unpack image "registry.internal.corp/pause:3.9":
failed to resolve reference "registry.internal.corp/pause:3.9":
unexpected status code 401 Unauthorized
or:
Warning Failed 2s kubelet Failed to create pod sandbox:
rpc error: code = Unknown
desc = failed to get sandbox image "registry.internal.corp/pause:3.9":
error getting credentials - err: docker-credential-ecr-login:
resolving credentials: secret "regcred" not found
Immediate consequence: The Pod never leaves ContainerCreating. No init containers run. No app containers start. If this is a Deployment rollout, the new ReplicaSet stalls, and depending on your maxUnavailable setting, the old pods may have already been terminated — you're now at zero healthy replicas.
The Attack Vector / Blast Radius
This failure mode has two distinct blast radii:
1. Operational (most common): A namespace-scoped Secret was created in default but the workload runs in production. Containerd's CRI plugin calls the kubelet credential provider, which calls the Kubernetes API for the secret — it's not found, auth fails, sandbox never initializes. Every pod in that Deployment fails simultaneously if the rollout is already in progress.
2. Security regression (the dangerous one): Teams under pressure "fix" this by switching private images to public mirrors or by granting imagePullSecrets with a service account that has wildcard registry access. A misconfigured dockerconfigjson with "auths": {"https://index.docker.io/v1/": {}} (empty credentials) silently falls back to unauthenticated pulls on some runtimes — meaning your image supply chain is now unverified. Worse, if the secret contains credentials for *.amazonaws.com ECR and is bound to a service account with overly broad RBAC, a compromised pod can enumerate and pull any image in that registry account.
Cascading risk: HPA-triggered scale-out events will repeatedly attempt and fail sandbox creation, generating thundering-herd kubelet log spam and elevated API server load from credential resolution retries.
How to Fix It
Step 1 — Confirm the secret exists in the right namespace
kubectl get secret regcred -n production
# If NotFound — that's your problem.
# Check what namespace it actually lives in:
kubectl get secrets --all-namespaces | grep regcred
Step 2 — Decode and validate the dockerconfigjson
kubectl get secret regcred -n production \
-o jsonpath='{.data.'\''.dockerconfigjson'\''}' | base64 -d | jq .
You must see the exact registry hostname that matches your image reference:
{
"auths": {
"registry.internal.corp": {
"username": "svc-k8s-pull",
"password": "<token>",
"auth": "<base64(user:pass)>"
}
}
}
If your image is registry.internal.corp/pause:3.9 but the secret has https://registry.internal.corp/v2/ — that mismatch is the bug. Containerd does exact-prefix matching on registry hosts.
Basic Fix — Recreate the secret with the correct hostname
kubectl create secret docker-registry regcred \
--docker-server=registry.internal.corp \
--docker-username=svc-k8s-pull \
--docker-password="$(cat /vault/secrets/registry-token)" \
-n production \
--dry-run=client -o yaml | kubectl apply -f -
Enterprise Best Practice — Attach the secret to the namespace's default ServiceAccount
Instead of adding imagePullSecrets to every Pod spec (which gets missed), patch it onto the ServiceAccount so all pods in the namespace inherit it automatically:
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: production
+imagePullSecrets:
+- name: regcred
Pod spec correction:
apiVersion: v1
kind: Pod
metadata:
name: app-worker
namespace: production
spec:
+ imagePullSecrets:
+ - name: regcred
containers:
- name: app
image: registry.internal.corp/app:v2.1.0
- # imagePullSecrets was missing or pointed to wrong secret name
Secret manifest (correct form):
apiVersion: v1
kind: Secret
metadata:
- name: reg-cred # wrong name — Pod spec referenced 'regcred'
+ name: regcred
- namespace: default # wrong namespace
+ namespace: production
type: kubernetes.io/dockerconfigjson
data:
- .dockerconfigjson: eyJhdXRocyI6eyJodHRwczovL3JlZ2lzdHJ5LmludGVybmFsLmNvcnAvdjIvIjp7fX19
- # ^ decoded: registry URL has https:// prefix + /v2/ suffix — containerd won't match this
+ .dockerconfigjson: eyJhdXRocyI6eyJyZWdpc3RyeS5pbnRlcm5hbC5jb3JwIjp7InVzZXJuYW1lIjoic3ZjLWs4cy1wdWxsIiwicGFzc3dvcmQiOiJ0b2tlbiIsImF1dGgiOiJiYXNlNjRlbmNvZGVkIn19fQ==
+ # ^ decoded: bare hostname 'registry.internal.corp' — correct for containerd
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. OPA/Gatekeeper policy — enforce imagePullSecrets presence:
package k8srequiredimagepullsecrets
violation[{"msg": msg}] {
input.review.object.kind == "Pod"
count(input.review.object.spec.imagePullSecrets) == 0
not input.review.object.spec.serviceAccountName # no SA-level secret either
msg := sprintf("Pod '%v' in namespace '%v' has no imagePullSecrets",
[input.review.object.metadata.name, input.review.object.metadata.namespace])
}
2. Checkov scan in your pipeline:
checkov -f pod.yaml --check CKV_K8S_35 # Ensures imagePullSecrets is set
3. Helm chart values.yaml guard:
# In your chart's _helpers.tpl, fail fast at render time:
{{- if and .Values.image.registry (not .Values.imagePullSecrets) }}
{{- fail "imagePullSecrets must be set when using a private registry" }}
{{- end }}
4. Namespace bootstrap automation: Use a Namespace provisioning controller (e.g., Hierarchical Namespace Controller or a simple operator) that automatically copies the regcred secret into every new namespace and patches the default ServiceAccount. Never rely on humans remembering to do this during incident-driven namespace creation.
5. Validate secret format in CI before deploy:
# In your GitHub Actions / GitLab CI pre-deploy step:
kubectl create secret docker-registry regcred \
--docker-server="$REGISTRY_HOST" \
--docker-username="$REGISTRY_USER" \
--docker-password="$REGISTRY_PASS" \
--dry-run=client -o json | \
jq -r '.data[".dockerconfigjson"]' | \
base64 -d | jq -e '.auths | keys[] | test("^[a-z0-9.-]+$")' \
|| (echo "FATAL: Registry hostname has invalid format for containerd" && exit 1)
This regex rejects https:// prefixes and trailing slashes before the secret ever reaches the cluster.