Initializing Enclave...

How to Fix PostgreSQL Active Replication Slot Missing WAL Files (Slot Bloat & Recovery Guide)

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

  • What broke: A replication slot (pg_replication_slots) is marked active but its restart_lsn points to WAL segments already removed from pg_wal/ — either by pg_archivecleanup, manual deletion, or wal_keep_size being too low.
  • How to fix it: Identify the slot, confirm no live consumer exists, drop the slot, then re-initialize replication from a fresh base backup or re-create the slot at the current LSN.
  • Use our Client-Side Sandbox below to auto-refactor this — paste your pg_replication_slots query output and Postgres logs and get the exact DROP/recreate commands.

The Incident (What Does the Error Mean?)

You'll see this in postgresql.log or pg_basebackup output:

FATAL:  requested WAL segment 000000010000001B00000042 has already been removed
ERROR:  replication slot "pglogical_sub_prod" is active but required WAL is missing
LOG:   starting point-in-time recovery to latest
FATAL:  could not find file "pg_wal/000000010000001B00000042"

What this means operationally:

  • The slot's restart_lsn is behind the oldest WAL segment still on disk.
  • PostgreSQL cannot stream the missing segments to the replica or logical subscriber.
  • The standby or subscriber is now completely disconnected and cannot self-heal without manual intervention.
  • If the slot remains, Postgres will not recycle WAL past restart_lsn, causing pg_wal/ to grow unbounded until disk is full — taking down the primary.

The Attack Vector / Blast Radius

This is a dual-failure mode: data loss risk + primary node disk exhaustion.

Cascade path:

  1. Replica falls behind (network blip, subscriber crash, slow consumer).
  2. wal_keep_size or archive cleanup removes segments the slot still needs.
  3. Slot is now permanently broken — the consumer cannot reconnect and catch up.
  4. Because the slot is still active (or inactive but not dropped), the WAL retention lock is still held.
  5. pg_wal/ grows at the WAL generation rate of your write workload — on a busy OLTP system this is gigabytes per hour.
  6. Disk fills → PostgreSQL panics → primary goes down. Replica was already down. You now have zero nodes.

Secondary blast radius: If this is a logical replication slot feeding a downstream analytics system or CDC pipeline (Debezium, pglogical, AWS DMS), that pipeline has a gap it cannot bridge. You will need a full re-snapshot.


How to Fix It

Step 1 — Confirm the Damage

-- Run on PRIMARY
SELECT
  slot_name,
  slot_type,
  active,
  active_pid,
  restart_lsn,
  confirmed_flush_lsn,
  pg_current_wal_lsn() AS current_lsn,
  pg_current_wal_lsn() - restart_lsn AS lag_bytes
FROM pg_replication_slots
ORDER BY restart_lsn;

If lag_bytes is in the hundreds of gigabytes, or if the WAL file derived from restart_lsn is absent from pg_wal/, the slot is unrecoverable.

# Derive the WAL filename from restart_lsn (replace with your LSN)
pg_walfile_name('0/1B000000')  -- run in psql
# Then check:
ls /var/lib/postgresql/data/pg_wal/000000010000001B*
# If empty output → file is gone. Slot is dead.

Basic Fix — Drop the Broken Slot

- -- Do NOT leave a broken slot running. It will fill your disk.
- SELECT * FROM pg_replication_slots WHERE slot_name = 'pglogical_sub_prod';

+ -- First, kill any zombie walsender/consumer holding the slot
+ SELECT pg_terminate_backend(active_pid)
+ FROM pg_replication_slots
+ WHERE slot_name = 'pglogical_sub_prod' AND active_pid IS NOT NULL;
+
+ -- Then drop it
+ SELECT pg_drop_replication_slot('pglogical_sub_prod');
+
+ -- Confirm pg_wal is now being reclaimed
+ SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

⚠️ Before dropping: confirm with your team that the subscriber/replica is truly dead and no in-flight transactions depend on this slot. For logical slots, this means a full re-snapshot of the subscriber.

Enterprise Best Practice — Prevent Silent Slot Bloat

postgresql.conf hardening:

- # No guardrails — slot can grow forever
- max_replication_slots = 10
- wal_keep_size = 0

+ # Hard-limit WAL retained by any single slot
+ max_slot_wal_keep_size = 10GB   # PostgreSQL 13+
+
+ # Keep enough WAL for planned maintenance windows
+ wal_keep_size = 2GB
+
+ # Alert before disk fills, not after
+ # Wire pg_replication_slots lag_bytes into your monitoring

Re-establish physical standby from scratch:

# On standby — full re-sync from primary
pg_basebackup \
  -h primary-host \
  -U replicator \
  -D /var/lib/postgresql/data \
  --wal-method=stream \
  --checkpoint=fast \
  --progress

Re-establish logical slot at current LSN (no historical data):

-- On PRIMARY — create fresh slot
SELECT pg_create_logical_replication_slot('pglogical_sub_prod', 'pgoutput');
-- Subscriber must do a full initial table sync (re-snapshot)

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Monitoring Alert (Prometheus + postgres_exporter)

# alerts.yml
- alert: ReplicationSlotWALLagCritical
  expr: pg_replication_slots_pg_wal_lsn_diff > 10737418240  # 10GB
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Slot {{ $labels.slot_name }} is {{ $value | humanize1024 }}B behind. WAL accumulating."

2. Automated Slot Reaper (cron / Kubernetes CronJob)

#!/bin/bash
# Drop any inactive slot lagging > 5GB — run every 15 min
psql -U postgres -c "
  SELECT pg_drop_replication_slot(slot_name)
  FROM pg_replication_slots
  WHERE active = false
    AND (pg_current_wal_lsn() - restart_lsn) > 5368709120;
"

3. Terraform / Ansible Guard

If you provision read replicas via Terraform (RDS, CloudSQL, self-managed), enforce max_slot_wal_keep_size as a non-negotiable parameter block. Use Checkov rule CKV_PG_1 or write a custom OPA policy to reject any PostgreSQL instance config missing this parameter.

- # RDS parameter group — no WAL slot cap
- resource "aws_db_parameter_group" "pg" {
-   family = "postgres15"
- }

+ resource "aws_db_parameter_group" "pg" {
+   family = "postgres15"
+   parameter {
+     name  = "max_slot_wal_keep_size"
+     value = "10240"  # 10GB in MB
+   }
+ }

4. Pre-deploy Check in CI

# Fail pipeline if any slot is lagging before a deploy
SLOT_LAG=$(psql -At -c "SELECT MAX(pg_current_wal_lsn() - restart_lsn) FROM pg_replication_slots;")
if [ "$SLOT_LAG" -gt 5368709120 ]; then
  echo "FATAL: Replication slot lag exceeds 5GB. Aborting deploy."
  exit 1
fi

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →