Initializing Enclave...

Fixing Containerd CRI 'Failed to Create Pod Sandbox' Caused by Image Pull Secret Mismatch

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins


TL;DR

  • What broke: Containerd's CRI shim cannot pull the pause/sandbox image (or the app image) because the imagePullSecret referenced in the Pod spec either doesn't exist in the correct namespace, has a malformed .dockerconfigjson, or references the wrong registry host.
  • How to fix it: Verify the Secret exists in the same namespace as the Pod, decode and validate the .dockerconfigjson payload, and ensure the registry hostname in the secret matches the image reference exactly.
  • Fast path: Use our Client-Side Sandbox above to auto-refactor your failing Pod spec and Secret manifest — paste both, get corrected YAML output instantly.

The Incident (What Does the Error Mean?)

You'll see this in kubectl describe pod <pod-name> or in the kubelet journal:

Warning  Failed     3s    kubelet  Failed to create pod sandbox:
  rpc error: code = Unknown
  desc = failed to pull image "registry.internal.corp/pause:3.9":
  failed to pull and unpack image "registry.internal.corp/pause:3.9":
  failed to resolve reference "registry.internal.corp/pause:3.9":
  unexpected status code 401 Unauthorized

or:

Warning  Failed  2s  kubelet  Failed to create pod sandbox:
  rpc error: code = Unknown
  desc = failed to get sandbox image "registry.internal.corp/pause:3.9":
  error getting credentials - err: docker-credential-ecr-login:
  resolving credentials: secret "regcred" not found

Immediate consequence: The Pod never leaves ContainerCreating. No init containers run. No app containers start. If this is a Deployment rollout, the new ReplicaSet stalls, and depending on your maxUnavailable setting, the old pods may have already been terminated — you're now at zero healthy replicas.


The Attack Vector / Blast Radius

This failure mode has two distinct blast radii:

1. Operational (most common): A namespace-scoped Secret was created in default but the workload runs in production. Containerd's CRI plugin calls the kubelet credential provider, which calls the Kubernetes API for the secret — it's not found, auth fails, sandbox never initializes. Every pod in that Deployment fails simultaneously if the rollout is already in progress.

2. Security regression (the dangerous one): Teams under pressure "fix" this by switching private images to public mirrors or by granting imagePullSecrets with a service account that has wildcard registry access. A misconfigured dockerconfigjson with "auths": {"https://index.docker.io/v1/": {}} (empty credentials) silently falls back to unauthenticated pulls on some runtimes — meaning your image supply chain is now unverified. Worse, if the secret contains credentials for *.amazonaws.com ECR and is bound to a service account with overly broad RBAC, a compromised pod can enumerate and pull any image in that registry account.

Cascading risk: HPA-triggered scale-out events will repeatedly attempt and fail sandbox creation, generating thundering-herd kubelet log spam and elevated API server load from credential resolution retries.


How to Fix It

Step 1 — Confirm the secret exists in the right namespace

kubectl get secret regcred -n production
# If NotFound — that's your problem.

# Check what namespace it actually lives in:
kubectl get secrets --all-namespaces | grep regcred

Step 2 — Decode and validate the dockerconfigjson

kubectl get secret regcred -n production \
  -o jsonpath='{.data.'\''.dockerconfigjson'\''}'  | base64 -d | jq .

You must see the exact registry hostname that matches your image reference:

{
  "auths": {
    "registry.internal.corp": {
      "username": "svc-k8s-pull",
      "password": "<token>",
      "auth": "<base64(user:pass)>"
    }
  }
}

If your image is registry.internal.corp/pause:3.9 but the secret has https://registry.internal.corp/v2/that mismatch is the bug. Containerd does exact-prefix matching on registry hosts.

Basic Fix — Recreate the secret with the correct hostname

kubectl create secret docker-registry regcred \
  --docker-server=registry.internal.corp \
  --docker-username=svc-k8s-pull \
  --docker-password="$(cat /vault/secrets/registry-token)" \
  -n production \
  --dry-run=client -o yaml | kubectl apply -f -

Enterprise Best Practice — Attach the secret to the namespace's default ServiceAccount

Instead of adding imagePullSecrets to every Pod spec (which gets missed), patch it onto the ServiceAccount so all pods in the namespace inherit it automatically:

 apiVersion: v1
 kind: ServiceAccount
 metadata:
   name: default
   namespace: production
+imagePullSecrets:
+- name: regcred

Pod spec correction:

 apiVersion: v1
 kind: Pod
 metadata:
   name: app-worker
   namespace: production
 spec:
+  imagePullSecrets:
+  - name: regcred
   containers:
   - name: app
     image: registry.internal.corp/app:v2.1.0
-  # imagePullSecrets was missing or pointed to wrong secret name

Secret manifest (correct form):

 apiVersion: v1
 kind: Secret
 metadata:
-  name: reg-cred          # wrong name — Pod spec referenced 'regcred'
+  name: regcred
-  namespace: default      # wrong namespace
+  namespace: production
 type: kubernetes.io/dockerconfigjson
 data:
-  .dockerconfigjson: eyJhdXRocyI6eyJodHRwczovL3JlZ2lzdHJ5LmludGVybmFsLmNvcnAvdjIvIjp7fX19
-  # ^ decoded: registry URL has https:// prefix + /v2/ suffix — containerd won't match this
+  .dockerconfigjson: eyJhdXRocyI6eyJyZWdpc3RyeS5pbnRlcm5hbC5jb3JwIjp7InVzZXJuYW1lIjoic3ZjLWs4cy1wdWxsIiwicGFzc3dvcmQiOiJ0b2tlbiIsImF1dGgiOiJiYXNlNjRlbmNvZGVkIn19fQ==
+  # ^ decoded: bare hostname 'registry.internal.corp' — correct for containerd

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. OPA/Gatekeeper policy — enforce imagePullSecrets presence:

package k8srequiredimagepullsecrets

violation[{"msg": msg}] {
  input.review.object.kind == "Pod"
  count(input.review.object.spec.imagePullSecrets) == 0
  not input.review.object.spec.serviceAccountName  # no SA-level secret either
  msg := sprintf("Pod '%v' in namespace '%v' has no imagePullSecrets",
    [input.review.object.metadata.name, input.review.object.metadata.namespace])
}

2. Checkov scan in your pipeline:

checkov -f pod.yaml --check CKV_K8S_35  # Ensures imagePullSecrets is set

3. Helm chart values.yaml guard:

# In your chart's _helpers.tpl, fail fast at render time:
{{- if and .Values.image.registry (not .Values.imagePullSecrets) }}
  {{- fail "imagePullSecrets must be set when using a private registry" }}
{{- end }}

4. Namespace bootstrap automation: Use a Namespace provisioning controller (e.g., Hierarchical Namespace Controller or a simple operator) that automatically copies the regcred secret into every new namespace and patches the default ServiceAccount. Never rely on humans remembering to do this during incident-driven namespace creation.

5. Validate secret format in CI before deploy:

# In your GitHub Actions / GitLab CI pre-deploy step:
kubectl create secret docker-registry regcred \
  --docker-server="$REGISTRY_HOST" \
  --docker-username="$REGISTRY_USER" \
  --docker-password="$REGISTRY_PASS" \
  --dry-run=client -o json | \
  jq -r '.data[".dockerconfigjson"]' | \
  base64 -d | jq -e '.auths | keys[] | test("^[a-z0-9.-]+$")' \
  || (echo "FATAL: Registry hostname has invalid format for containerd" && exit 1)

This regex rejects https:// prefixes and trailing slashes before the secret ever reaches the cluster.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →