How to Fix PostgreSQL Active Replication Slot Missing WAL Files (Slot Bloat & Recovery Guide)
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: A replication slot (
pg_replication_slots) is markedactivebut itsrestart_lsnpoints to WAL segments already removed frompg_wal/— either bypg_archivecleanup, manual deletion, orwal_keep_sizebeing too low. - How to fix it: Identify the slot, confirm no live consumer exists, drop the slot, then re-initialize replication from a fresh base backup or re-create the slot at the current LSN.
- Use our Client-Side Sandbox below to auto-refactor this — paste your
pg_replication_slotsquery output and Postgres logs and get the exact DROP/recreate commands.
The Incident (What Does the Error Mean?)
You'll see this in postgresql.log or pg_basebackup output:
FATAL: requested WAL segment 000000010000001B00000042 has already been removed
ERROR: replication slot "pglogical_sub_prod" is active but required WAL is missing
LOG: starting point-in-time recovery to latest
FATAL: could not find file "pg_wal/000000010000001B00000042"
What this means operationally:
- The slot's
restart_lsnis behind the oldest WAL segment still on disk. - PostgreSQL cannot stream the missing segments to the replica or logical subscriber.
- The standby or subscriber is now completely disconnected and cannot self-heal without manual intervention.
- If the slot remains, Postgres will not recycle WAL past
restart_lsn, causingpg_wal/to grow unbounded until disk is full — taking down the primary.
The Attack Vector / Blast Radius
This is a dual-failure mode: data loss risk + primary node disk exhaustion.
Cascade path:
- Replica falls behind (network blip, subscriber crash, slow consumer).
wal_keep_sizeor archive cleanup removes segments the slot still needs.- Slot is now permanently broken — the consumer cannot reconnect and catch up.
- Because the slot is still
active(orinactivebut not dropped), the WAL retention lock is still held. pg_wal/grows at the WAL generation rate of your write workload — on a busy OLTP system this is gigabytes per hour.- Disk fills → PostgreSQL panics → primary goes down. Replica was already down. You now have zero nodes.
Secondary blast radius: If this is a logical replication slot feeding a downstream analytics system or CDC pipeline (Debezium, pglogical, AWS DMS), that pipeline has a gap it cannot bridge. You will need a full re-snapshot.
How to Fix It
Step 1 — Confirm the Damage
-- Run on PRIMARY
SELECT
slot_name,
slot_type,
active,
active_pid,
restart_lsn,
confirmed_flush_lsn,
pg_current_wal_lsn() AS current_lsn,
pg_current_wal_lsn() - restart_lsn AS lag_bytes
FROM pg_replication_slots
ORDER BY restart_lsn;
If lag_bytes is in the hundreds of gigabytes, or if the WAL file derived from restart_lsn is absent from pg_wal/, the slot is unrecoverable.
# Derive the WAL filename from restart_lsn (replace with your LSN)
pg_walfile_name('0/1B000000') -- run in psql
# Then check:
ls /var/lib/postgresql/data/pg_wal/000000010000001B*
# If empty output → file is gone. Slot is dead.
Basic Fix — Drop the Broken Slot
- -- Do NOT leave a broken slot running. It will fill your disk.
- SELECT * FROM pg_replication_slots WHERE slot_name = 'pglogical_sub_prod';
+ -- First, kill any zombie walsender/consumer holding the slot
+ SELECT pg_terminate_backend(active_pid)
+ FROM pg_replication_slots
+ WHERE slot_name = 'pglogical_sub_prod' AND active_pid IS NOT NULL;
+
+ -- Then drop it
+ SELECT pg_drop_replication_slot('pglogical_sub_prod');
+
+ -- Confirm pg_wal is now being reclaimed
+ SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
⚠️ Before dropping: confirm with your team that the subscriber/replica is truly dead and no in-flight transactions depend on this slot. For logical slots, this means a full re-snapshot of the subscriber.
Enterprise Best Practice — Prevent Silent Slot Bloat
postgresql.conf hardening:
- # No guardrails — slot can grow forever
- max_replication_slots = 10
- wal_keep_size = 0
+ # Hard-limit WAL retained by any single slot
+ max_slot_wal_keep_size = 10GB # PostgreSQL 13+
+
+ # Keep enough WAL for planned maintenance windows
+ wal_keep_size = 2GB
+
+ # Alert before disk fills, not after
+ # Wire pg_replication_slots lag_bytes into your monitoring
Re-establish physical standby from scratch:
# On standby — full re-sync from primary
pg_basebackup \
-h primary-host \
-U replicator \
-D /var/lib/postgresql/data \
--wal-method=stream \
--checkpoint=fast \
--progress
Re-establish logical slot at current LSN (no historical data):
-- On PRIMARY — create fresh slot
SELECT pg_create_logical_replication_slot('pglogical_sub_prod', 'pgoutput');
-- Subscriber must do a full initial table sync (re-snapshot)
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Monitoring Alert (Prometheus + postgres_exporter)
# alerts.yml
- alert: ReplicationSlotWALLagCritical
expr: pg_replication_slots_pg_wal_lsn_diff > 10737418240 # 10GB
for: 5m
labels:
severity: critical
annotations:
summary: "Slot {{ $labels.slot_name }} is {{ $value | humanize1024 }}B behind. WAL accumulating."
2. Automated Slot Reaper (cron / Kubernetes CronJob)
#!/bin/bash
# Drop any inactive slot lagging > 5GB — run every 15 min
psql -U postgres -c "
SELECT pg_drop_replication_slot(slot_name)
FROM pg_replication_slots
WHERE active = false
AND (pg_current_wal_lsn() - restart_lsn) > 5368709120;
"
3. Terraform / Ansible Guard
If you provision read replicas via Terraform (RDS, CloudSQL, self-managed), enforce max_slot_wal_keep_size as a non-negotiable parameter block. Use Checkov rule CKV_PG_1 or write a custom OPA policy to reject any PostgreSQL instance config missing this parameter.
- # RDS parameter group — no WAL slot cap
- resource "aws_db_parameter_group" "pg" {
- family = "postgres15"
- }
+ resource "aws_db_parameter_group" "pg" {
+ family = "postgres15"
+ parameter {
+ name = "max_slot_wal_keep_size"
+ value = "10240" # 10GB in MB
+ }
+ }
4. Pre-deploy Check in CI
# Fail pipeline if any slot is lagging before a deploy
SLOT_LAG=$(psql -At -c "SELECT MAX(pg_current_wal_lsn() - restart_lsn) FROM pg_replication_slots;")
if [ "$SLOT_LAG" -gt 5368709120 ]; then
echo "FATAL: Replication slot lag exceeds 5GB. Aborting deploy."
exit 1
fi