Fixing Rook Ceph 'OSD Not Found' After Node Drain: Recovery Guide for Degraded PGs
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on PG recovery state
TL;DR
- What broke:
kubectl drainevicted the OSD pod without Ceph completing PG rebalancing, leaving OSDs in adown/outstate and PGs undersized or degraded. - How to fix it: Force the OSD back
in, restart the OSD pod, and verify PG recovery completes before uncordoning or proceeding with maintenance. - Shortcut: Use our Client-Side Sandbox above — paste your
ceph statusor CephCluster YAML and get auto-generated recovery commands without leaking your cluster config.
The Incident (What Does the Error Mean?)
You ran kubectl drain <node> for routine maintenance. Now:
$ ceph status
cluster:
health: HEALTH_ERR
services:
osd: 3 osds: 2 up, 2 in; 1 OSD down
data:
pools: 3 pools, 96 pgs
pgs: 24 undersized
24 undersized+degraded
48 active+clean
io:
client: 0 B/s rd, 0 B/s wr, 0 op/s rd, 0 op/s wr
And in the Rook operator logs:
ERROR: OSD id=2 not found on node worker-2, pod rook-ceph-osd-2 evicted
Failed to find OSD deployment for osd.2
Immediate consequence: With a replica-size-3 pool now running on 2 OSDs, any second OSD failure causes full data unavailability. Writes may block depending on min_size configuration. This is a partial outage.
The Blast Radius
This is not a cosmetic warning. Here is the failure cascade:
kubectl draindoes not consult Ceph quorum. It evicts pods based on PodDisruptionBudgets (PDBs). If your Rook deployment does not have a correctly configured PDB withmaxUnavailable: 0for OSD pods, drain proceeds immediately.- Ceph marks the OSD
downafter 20s (defaultmon osd down out intervalis 600s). Until it goesout, PGs are undersized. After 600s, Ceph starts backfilling to remaining OSDs — doubling write amplification on surviving disks. - If backfill completes before you recover the OSD, the original OSD comes back as a foreign device and triggers a second rebalance wave.
- With
min_size=1pools (misconfigured dev clusters promoted to prod), degraded PGs will still serve I/O — silently serving potentially stale data.
The security angle: A degraded cluster under I/O pressure is the worst time to discover your mon_allow_pool_delete is true or that a runaway process has ceph osd pool delete permissions.
How to Fix It
Step 1 — Immediately Pause the OSD Out Timer
SSH to a toolbox pod or exec into the Rook toolbox:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
Set the OSD noout flag to stop Ceph from rebalancing away from the drained node:
ceph osd set noout
ceph osd set norebalance
Do this within the first 600 seconds or you will trigger a full backfill.
Step 2 — Identify the Missing OSD
ceph osd tree
ceph osd status
Note the OSD ID that is down. Cross-reference with:
kubectl -n rook-ceph get pods -o wide | grep osd
Step 3 — Uncordon the Node and Restart the OSD Pod
kubectl uncordon <node-name>
kubectl -n rook-ceph delete pod <rook-ceph-osd-N-pod-name>
Rook operator will reschedule the OSD pod on the uncordoned node. Watch recovery:
watch ceph status
Step 4 — Unset Flags After OSD is Back Up
Only after ceph status shows all PGs active+clean:
ceph osd unset noout
ceph osd unset norebalance
Basic Fix vs. Enterprise Best Practice
Basic Fix: Manually set noout before every drain. Fragile, human-error prone.
Enterprise Best Practice: Enforce a pre-drain hook via a MachineConfig or drain automation script, AND fix the CephCluster CR to deploy proper PodDisruptionBudgets.
# CephCluster CR — rook-ceph-cluster.yaml
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
namespace: rook-ceph
spec:
disruption:
- managePodBudgets: false
+ managePodBudgets: true
+ osdMaintenanceTimeout: 30
+ pgHealthCheckTimeout: 10
storage:
useAllNodes: true
useAllDevices: false
+ # Ensure replica pools never go below min_size=2
+ # Configure this at pool level, not cluster level
# Pre-drain wrapper script — drain-node.sh
+#!/bin/bash
+set -euo pipefail
+NODE=$1
+TOOLBOX_POD=$(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name | head -1)
+
+echo "[+] Setting Ceph noout flag before drain..."
+kubectl -n rook-ceph exec $TOOLBOX_POD -- ceph osd set noout
+kubectl -n rook-ceph exec $TOOLBOX_POD -- ceph osd set norebalance
+
+echo "[+] Draining node $NODE..."
+kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
+
+echo "[+] Drain complete. Monitor PG health before unsetting flags."
+echo " Run: ceph osd unset noout && ceph osd unset norebalance"
-# Previously: kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
-# No Ceph coordination. Caused OSD eviction mid-rebalance.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. OPA/Gatekeeper Policy — Block Drain Without Noout
Enforce via a ValidatingWebhookConfiguration that checks for the ceph-noout-set annotation on Node drain operations. This is advanced but achievable with a custom admission webhook or Kyverno policy:
# kyverno-policy-drain-guard.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-ceph-noout-before-drain
spec:
validationFailureAction: Enforce
rules:
- name: check-noout-annotation
match:
resources:
kinds: [Node]
operations: [UPDATE]
validate:
message: "Node drain requires annotation ceph.rook.io/noout-set=true"
pattern:
metadata:
annotations:
ceph.rook.io/noout-set: "true"
2. Checkov / Helm Chart Linting
Add to your CI pipeline:
checkov -f rook-ceph-cluster.yaml --check CKV_K8S_35
# Also validate PDB existence:
kubectl -n rook-ceph get pdb -l app=rook-ceph-osd
If no PDB exists for OSD pods, your pipeline should fail.
3. Alerting — Catch It Before It Cascades
Prometheus alert to fire the moment an OSD goes down, before the 600s out timer:
- alert: CephOSDDown
expr: ceph_osd_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OSD {{ $labels.ceph_daemon }} is DOWN — set noout immediately"
runbook: "https://your-wiki/rook-ceph-osd-recovery"
4. Terraform / GitOps Enforcement
If managing Rook via Helm/Terraform, pin disruption.managePodBudgets: true in your values file and enforce it via a terraform plan check in your PR pipeline. Any PR that sets this to false should require a storage team approval gate.