What is the default etcd storage quota and why is it so small?

The default `--quota-backend-bytes` is 2,147,483,648 bytes (2GB). It was set conservatively because etcd is designed as a small, consistent metadata store — not a general-purpose database. High-churn Kubernetes clusters with many CRDs, frequent HPA events, or GitOps controllers easily exhaust this. The maximum supported value is 8GB; beyond that, bolt DB performance degrades significantly.

Why does defragging only one etcd member not fix the problem?

The NOSPACE alarm is set per-member but the cluster remains in alarm state until ALL members are defragmented and `alarm disarm` is called. If you defrag the leader only, the follower members still hold the alarm and the kube-apiserver — which load-balances across all endpoints — will continue receiving write rejections from the non-defragmented members.

Will compacting etcd revisions cause data loss?

No. Compaction removes old MVCC revision history — the intermediate states of objects that have since been updated or deleted. The latest revision of every live object is fully preserved. The only operational impact is that `kubectl get --watch` clients that were watching from a very old resource version will receive a 'compacted' error and need to re-list, which Kubernetes controllers handle automatically.

How to Fix 'Etcd mvcc: database space exceeded' and Restore Your Kubernetes Control Plane

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: Etcd hit its --quota-backend-bytes ceiling (default 2GB). All Kubernetes API write operations — deployments, configmaps, secrets — are now rejected with rpc error: code = ResourceExhausted.
How to fix it: Force compaction on old MVCC revisions, run etcdctl defrag on every member, then raise the quota and restart.
Fast path: Use our Client-Side Sandbox above to paste your etcd pod manifest or etcdctl endpoint status output and auto-generate the corrected defrag sequence and patched manifest.

The Incident (What does the error mean?)

Raw error surface — you'll see this in kube-apiserver logs and from kubectl itself:

etcdserver: mvcc: database space exceeded
rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded

Etcd uses a MVCC (Multi-Version Concurrency Control) bolt database. Every write appends a new revision; old revisions are NOT garbage-collected automatically unless compaction is explicitly triggered. When the bolt DB file on disk crosses the --quota-backend-bytes threshold, etcd raises an alarm and enters a read-only state. At that point:

kubectl apply fails.
New pod scheduling stops.
The Kubernetes controller manager cannot reconcile any resources.
Your cluster is effectively frozen.

The Attack Vector / Blast Radius

This is a full control-plane outage vector. The cascading failure path:

Etcd alarm raised → etcd stops accepting writes.
Kube-apiserver returns 503 or ResourceExhausted to all mutating requests.
Controller Manager and Scheduler lose the ability to write status updates — running pods continue, but no new scheduling decisions are persisted.
Operators and Helm releases begin timing out and retrying, generating even more write pressure on recovery.
In multi-member clusters, if you defrag only one member, the others remain in alarm state — a common mistake that extends the outage.

High-churn workloads (HPA thrashing, frequent ConfigMap/Secret rotations, high-volume CRD controllers like Flux or ArgoCD) are the primary drivers. A single runaway controller writing status patches every few seconds can exhaust 2GB in hours.

How to Fix It (The Solution)

Step 1 — Confirm the alarm

etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  alarm list
# Expected: memberID:XXXXXXXX alarm:NOSPACE

Step 2 — Compact old revisions

# Get current revision
REV=$(etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  endpoint status --write-out="json" | jq '.[0].Status.header.revision')

# Compact to current revision
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  compact $REV

Step 3 — Defrag ALL members (critical — do not skip members)

etcdctl --endpoints=https://etcd1:2379,https://etcd2:2379,https://etcd3:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  defrag --cluster

Step 4 — Disarm the alarm

etcdctl ... alarm disarm

Basic Fix — Raise the quota

# /etc/kubernetes/manifests/etcd.yaml (static pod)
  - command:
    - etcd
-   - --quota-backend-bytes=2147483648
+   - --quota-backend-bytes=8589934592

Warning: The hard ceiling is 8GB. Beyond that, etcd performance degrades non-linearly. If you're hitting 8GB, you have an architectural problem — not a quota problem.

Enterprise Best Practice — Enable automatic compaction

# /etc/kubernetes/manifests/etcd.yaml
  - command:
    - etcd
+   - --auto-compaction-mode=periodic
+   - --auto-compaction-retention=1h
-   # No compaction configured — revisions accumulate indefinitely
    - --quota-backend-bytes=8589934592

For kubeadm-managed clusters, patch via ClusterConfiguration:

# kubeadm-config.yaml
 apiVersion: kubeadm.k8s.io/v1beta3
 kind: ClusterConfiguration
 etcd:
   local:
     extraArgs:
+      auto-compaction-mode: "periodic"
+      auto-compaction-retention: "1h"
+      quota-backend-bytes: "8589934592"
-      # missing compaction config

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Alert before you hit the wall. Prometheus rule:

- alert: EtcdDatabaseQuotaUsageHigh
  expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.80
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Etcd DB usage above 80% on {{ $labels.instance }}"

2. OPA/Gatekeeper policy — reject CRDs from controllers that write high-frequency status patches without resourceVersion guards.

3. Enforce compaction in your etcd Helm chart or Ansible role — make auto-compaction-retention a required, non-defaultable parameter in your IaC. Fail the pipeline if it's absent:

# Checkov custom check (pseudocode)
if 'auto-compaction-retention' not in etcd_args:
    raise CheckovCheckFailure("Etcd compaction not configured")

4. Regular defrag job — run etcdctl defrag as a CronJob on a maintenance window, not reactively during an outage.

5. Snapshot + size audit in CI — after every major deployment, run etcdctl snapshot save and assert db_size < threshold as a pipeline gate.