Can I recover data from a Longhorn volume with zero replicas?

Possibly. If the underlying disk is still physically present on a node, use the Longhorn UI 'Request Last Replica Salvage' option on the volume before deleting any replicas. This attempts to promote the last faulted replica to active so data can be read. If the disk is gone and you have no Longhorn backup configured, data is unrecoverable. This is why `reclaimPolicy: Retain` and recurring backup jobs are non-negotiable in production.

Why does Longhorn show replica count zero after a node reboot?

On reboot, if the node's disk is not re-registered before the Longhorn engine times out (controlled by `engine-replica-timeout`), replicas on that disk are marked Error. If you only had one replica (numberOfReplicas: 1), count immediately drops to zero. Fix: increase `staleReplicaTimeout` to at least 2880 minutes for production, always run 3 replicas, and ensure node disk tags are preserved across reboots via the Longhorn node config.

How do I prevent Longhorn from scheduling all replicas on the same node?

Enable `replica-soft-anti-affinity: true` and `replica-zone-soft-anti-affinity: true` in Longhorn global settings. For hard enforcement, set `replica-soft-anti-affinity: false` — this blocks scheduling if anti-affinity cannot be satisfied, which is safer for multi-AZ clusters. Also use `nodeSelector` and `diskSelector` in your StorageClass parameters to pin replicas to labeled, dedicated storage nodes.

How to Fix Longhorn 'Volume Controller Failed to Attach' When Replica Count Is Zero

Threat/Impact Level: CRITICAL | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on data recovery path

TL;DR

What broke: Longhorn has zero schedulable replicas for the volume — all replicas are in Error or Stopped state, so the volume controller cannot attach to any node.
How to fix it: Force-delete faulted replicas, verify node/disk scheduling is enabled, and trigger a replica rebuild or restore from backup.
Use our Client-Side Sandbox below to paste your Longhorn Volume YAML and auto-generate the corrected spec with proper replica scheduling constraints.

The Incident (What Does the Error Mean?)

Raw error surface — typically seen in kubectl describe pod or Longhorn Manager logs:

Warning  FailedAttachVolume  kubelet  AttachVolume.Attach failed for volume "pvc-a1b2c3d4":
  rpc error: code = Internal desc = failed to attach volume: volume controller failed to attach,
  no healthy replica available, replica count: 0

And in Longhorn Manager (kubectl logs -n longhorn-system -l app=longhorn-manager):

time="..." level=error msg="failed to attach volume pvc-a1b2c3d4"
  reason="replica count is 0, no replica available for scheduling"

Immediate consequence: Every pod mounting this PVC is stuck in ContainerCreating or Pending. StatefulSets, databases, and any workload with a volumeMount referencing this PVC are fully down. Kubernetes will not reschedule the pod to another node — it will sit and retry indefinitely.

The Attack Vector / Blast Radius

This is not a transient blip. Zero replicas means Longhorn has no copy of the volume data it can serve. How you got here matters:

Scenario A — Node mass eviction or cloud provider spot interruption: All nodes hosting replicas were terminated simultaneously. Longhorn marks replicas Error. If replicaCount: 1 was set (anti-pattern), you have zero redundancy and zero recovery path without a backup.

Scenario B — Disk pressure / node taint: Nodes were tainted (node.longhorn.io/create-default-disk: false) or disks were manually disabled in the Longhorn UI. The scheduler cannot place replicas. Existing replicas on those disks go Stopped.

Scenario C — Longhorn upgrade gone wrong: A manager rollout left replica CRDs in an inconsistent state. The engine image mismatch causes replicas to fail health checks.

Blast radius: In a StatefulSet with 10 pods across 10 PVCs, if your storage class has numberOfReplicas: 1 and a single AZ goes down, all 10 volumes hit zero replicas simultaneously. Your entire stateful tier is offline. This cascades to connection pool exhaustion in upstream services within seconds.

How to Fix It

Step 1 — Triage: Identify replica state

# Get volume status
kubectl get volume.longhorn.io -n longhorn-system pvc-a1b2c3d4 -o yaml

# List all replicas for this volume
kubectl get replicas.longhorn.io -n longhorn-system \
  -l longhornvolume=pvc-a1b2c3d4 -o wide

# Check node/disk scheduling status
kubectl get nodes.longhorn.io -n longhorn-system -o yaml | \
  grep -A5 'allowScheduling'

Step 2 — Basic Fix (Re-enable scheduling and force replica rebuild)

# Re-enable disk scheduling on affected node (if disabled)
kubectl patch node.longhorn.io <node-name> -n longhorn-system \
  --type=json \
  -p='[{"op":"replace","path":"/spec/disks/<disk-id>/allowScheduling","value":true}]'

# Delete all Error/Stopped replicas to force rescheduling
kubectl delete replicas.longhorn.io -n longhorn-system \
  -l longhornvolume=pvc-a1b2c3d4

# Longhorn will attempt to schedule new replicas — watch the volume
kubectl get volume.longhorn.io -n longhorn-system pvc-a1b2c3d4 -w

If the volume has salvageable data on a faulted replica, use the Longhorn UI: Volume → Request Last Replica Salvage before deleting replicas.

Step 3 — Enterprise Best Practice (Fix the root cause in the StorageClass)

The real fix is never running numberOfReplicas: 1 for stateful workloads and enforcing node anti-affinity for replica placement.

 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
   name: longhorn-production
 provisioner: driver.longhorn.io
 parameters:
-  numberOfReplicas: "1"
+  numberOfReplicas: "3"
-  staleReplicaTimeout: "30"
+  staleReplicaTimeout: "2880"
+  replicaAutoBalance: "best-effort"
+  diskSelector: "ssd"
+  nodeSelector: "storage"
   fromBackup: ""
 reclaimPolicy: Retain
+volumeBindingMode: WaitForFirstConsumer

And for the Longhorn default settings (applied via kubectl edit settings.longhorn.io -n longhorn-system):

 # Setting: default-replica-count
- value: "1"
+ value: "3"

 # Setting: replica-soft-anti-affinity
- value: "false"
+ value: "true"

 # Setting: replica-zone-soft-anti-affinity  
- value: "false"
+ value: "true"

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

OPA/Gatekeeper Policy — Block single-replica StorageClasses

package longhorn.storagepolicy

deny[msg] {
  input.kind == "StorageClass"
  input.provisioner == "driver.longhorn.io"
  to_number(input.parameters.numberOfReplicas) < 2
  msg := sprintf(
    "StorageClass '%v' sets numberOfReplicas < 2. Minimum is 2 for non-dev workloads.",
    [input.metadata.name]
  )
}

Checkov / Helm pre-flight check

# Add to your CI pipeline before helm upgrade
checkov -d ./helm/values.yaml \
  --check CKV_LONGHORN_REPLICA_COUNT \
  --compact

Longhorn Backup Policy (non-negotiable for production)

# Recurring job — snapshot every 1h, backup every 6h, retain 14 days
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: production-backup
  namespace: longhorn-system
spec:
  cron: "0 */6 * * *"
  task: backup
  groups:
    - default
  retain: 56
  concurrency: 2

Tag all production PVCs with recurring-job-group.longhorn.io/default: enabled so this job picks them up automatically. Without this, zero replicas + no backup = unrecoverable data loss.

Alerting — Don't find out from users

# Prometheus alert rule
- alert: LonghornVolumeReplicaCountCritical
  expr: longhorn_volume_robustness == 3  # 3 = Faulted
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Longhorn volume {{ $labels.volume }} is Faulted — replica count may be zero"
    runbook: "https://your-wiki/longhorn-replica-recovery"