Initializing Enclave...

Fixing Rook Ceph 'OSD Not Found' After Node Drain: Recovery Guide for Degraded PGs

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on PG recovery state

TL;DR

  • What broke: kubectl drain evicted the OSD pod without Ceph completing PG rebalancing, leaving OSDs in a down/out state and PGs undersized or degraded.
  • How to fix it: Force the OSD back in, restart the OSD pod, and verify PG recovery completes before uncordoning or proceeding with maintenance.
  • Shortcut: Use our Client-Side Sandbox above — paste your ceph status or CephCluster YAML and get auto-generated recovery commands without leaking your cluster config.

The Incident (What Does the Error Mean?)

You ran kubectl drain <node> for routine maintenance. Now:

$ ceph status
  cluster:
    health: HEALTH_ERR

  services:
    osd: 3 osds: 2 up, 2 in; 1 OSD down

  data:
    pools: 3 pools, 96 pgs
    pgs: 24 undersized
         24 undersized+degraded
         48 active+clean

  io:
    client: 0 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr

And in the Rook operator logs:

ERROR: OSD id=2 not found on node worker-2, pod rook-ceph-osd-2 evicted
Failed to find OSD deployment for osd.2

Immediate consequence: With a replica-size-3 pool now running on 2 OSDs, any second OSD failure causes full data unavailability. Writes may block depending on min_size configuration. This is a partial outage.


The Blast Radius

This is not a cosmetic warning. Here is the failure cascade:

  1. kubectl drain does not consult Ceph quorum. It evicts pods based on PodDisruptionBudgets (PDBs). If your Rook deployment does not have a correctly configured PDB with maxUnavailable: 0 for OSD pods, drain proceeds immediately.
  2. Ceph marks the OSD down after 20s (default mon osd down out interval is 600s). Until it goes out, PGs are undersized. After 600s, Ceph starts backfilling to remaining OSDs — doubling write amplification on surviving disks.
  3. If backfill completes before you recover the OSD, the original OSD comes back as a foreign device and triggers a second rebalance wave.
  4. With min_size=1 pools (misconfigured dev clusters promoted to prod), degraded PGs will still serve I/O — silently serving potentially stale data.

The security angle: A degraded cluster under I/O pressure is the worst time to discover your mon_allow_pool_delete is true or that a runaway process has ceph osd pool delete permissions.


How to Fix It

Step 1 — Immediately Pause the OSD Out Timer

SSH to a toolbox pod or exec into the Rook toolbox:

kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash

Set the OSD noout flag to stop Ceph from rebalancing away from the drained node:

ceph osd set noout
ceph osd set norebalance

Do this within the first 600 seconds or you will trigger a full backfill.

Step 2 — Identify the Missing OSD

ceph osd tree
ceph osd status

Note the OSD ID that is down. Cross-reference with:

kubectl -n rook-ceph get pods -o wide | grep osd

Step 3 — Uncordon the Node and Restart the OSD Pod

kubectl uncordon <node-name>
kubectl -n rook-ceph delete pod <rook-ceph-osd-N-pod-name>

Rook operator will reschedule the OSD pod on the uncordoned node. Watch recovery:

watch ceph status

Step 4 — Unset Flags After OSD is Back Up

Only after ceph status shows all PGs active+clean:

ceph osd unset noout
ceph osd unset norebalance

Basic Fix vs. Enterprise Best Practice

Basic Fix: Manually set noout before every drain. Fragile, human-error prone.

Enterprise Best Practice: Enforce a pre-drain hook via a MachineConfig or drain automation script, AND fix the CephCluster CR to deploy proper PodDisruptionBudgets.

# CephCluster CR — rook-ceph-cluster.yaml
 apiVersion: ceph.rook.io/v1
 kind: CephCluster
 metadata:
   name: rook-ceph
   namespace: rook-ceph
 spec:
   disruption:
-    managePodBudgets: false
+    managePodBudgets: true
+    osdMaintenanceTimeout: 30
+    pgHealthCheckTimeout: 10
   storage:
     useAllNodes: true
     useAllDevices: false
+  # Ensure replica pools never go below min_size=2
+  # Configure this at pool level, not cluster level
# Pre-drain wrapper script — drain-node.sh
+#!/bin/bash
+set -euo pipefail
+NODE=$1
+TOOLBOX_POD=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | head -1)
+
+echo "[+] Setting Ceph noout flag before drain..."
+kubectl -n rook-ceph exec $TOOLBOX_POD -- ceph osd set noout
+kubectl -n rook-ceph exec $TOOLBOX_POD -- ceph osd set norebalance
+
+echo "[+] Draining node $NODE..."
+kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
+
+echo "[+] Drain complete. Monitor PG health before unsetting flags."
+echo "    Run: ceph osd unset noout && ceph osd unset norebalance"
-# Previously: kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
-# No Ceph coordination. Caused OSD eviction mid-rebalance.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. OPA/Gatekeeper Policy — Block Drain Without Noout

Enforce via a ValidatingWebhookConfiguration that checks for the ceph-noout-set annotation on Node drain operations. This is advanced but achievable with a custom admission webhook or Kyverno policy:

# kyverno-policy-drain-guard.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-ceph-noout-before-drain
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-noout-annotation
    match:
      resources:
        kinds: [Node]
        operations: [UPDATE]
    validate:
      message: "Node drain requires annotation ceph.rook.io/noout-set=true"
      pattern:
        metadata:
          annotations:
            ceph.rook.io/noout-set: "true"

2. Checkov / Helm Chart Linting

Add to your CI pipeline:

checkov -f rook-ceph-cluster.yaml --check CKV_K8S_35
# Also validate PDB existence:
kubectl -n rook-ceph get pdb -l app=rook-ceph-osd

If no PDB exists for OSD pods, your pipeline should fail.

3. Alerting — Catch It Before It Cascades

Prometheus alert to fire the moment an OSD goes down, before the 600s out timer:

- alert: CephOSDDown
  expr: ceph_osd_up == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "OSD {{ $labels.ceph_daemon }} is DOWN — set noout immediately"
    runbook: "https://your-wiki/rook-ceph-osd-recovery"

4. Terraform / GitOps Enforcement

If managing Rook via Helm/Terraform, pin disruption.managePodBudgets: true in your values file and enforce it via a terraform plan check in your PR pipeline. Any PR that sets this to false should require a storage team approval gate.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →