Can I drop an 'active' replication slot in PostgreSQL?

Yes, but you must first terminate the backend holding it. Run `SELECT pg_terminate_backend(active_pid) FROM pg_replication_slots WHERE slot_name = 'your_slot';` then immediately follow with `SELECT pg_drop_replication_slot('your_slot');`. Dropping an active slot will disconnect the consumer — ensure the downstream replica or subscriber is already down or you accept the disconnect.

What is max_slot_wal_keep_size and why is it critical?

Introduced in PostgreSQL 13, `max_slot_wal_keep_size` sets a hard ceiling (in MB) on how much WAL a single replication slot can retain. Without it, a lagging or orphaned slot will hold WAL indefinitely until disk is full and the primary crashes. Set it to a value your disk can tolerate — typically 10–50GB — and monitor slot lag separately to catch consumers falling behind before they hit the cap.

After dropping a broken logical replication slot, do I need a full re-snapshot?

Yes, always. A logical replication slot encodes the LSN position from which changes are streamed. Once the slot is dropped, the subscriber has no consistent starting point. You must re-initialize the subscription with `CREATE SUBSCRIPTION ... WITH (copy_data = true)` or trigger a full initial sync in your CDC tool (Debezium, pglogical, DMS). There is no way to replay the missed changes from a dropped slot.

How to Fix PostgreSQL Active Replication Slot Missing WAL Files (Slot Bloat & Recovery Guide)

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: A replication slot (pg_replication_slots) is marked active but its restart_lsn points to WAL segments already removed from pg_wal/ — either by pg_archivecleanup, manual deletion, or wal_keep_size being too low.
How to fix it: Identify the slot, confirm no live consumer exists, drop the slot, then re-initialize replication from a fresh base backup or re-create the slot at the current LSN.
Use our Client-Side Sandbox below to auto-refactor this — paste your pg_replication_slots query output and Postgres logs and get the exact DROP/recreate commands.

The Incident (What Does the Error Mean?)

You'll see this in postgresql.log or pg_basebackup output:

FATAL:  requested WAL segment 000000010000001B00000042 has already been removed
ERROR:  replication slot "pglogical_sub_prod" is active but required WAL is missing
LOG:   starting point-in-time recovery to latest
FATAL:  could not find file "pg_wal/000000010000001B00000042"

What this means operationally:

The slot's restart_lsn is behind the oldest WAL segment still on disk.
PostgreSQL cannot stream the missing segments to the replica or logical subscriber.
The standby or subscriber is now completely disconnected and cannot self-heal without manual intervention.
If the slot remains, Postgres will not recycle WAL past restart_lsn, causing pg_wal/ to grow unbounded until disk is full — taking down the primary.

The Attack Vector / Blast Radius

This is a dual-failure mode: data loss risk + primary node disk exhaustion.

Cascade path:

Replica falls behind (network blip, subscriber crash, slow consumer).
wal_keep_size or archive cleanup removes segments the slot still needs.
Slot is now permanently broken — the consumer cannot reconnect and catch up.
Because the slot is still active (or inactive but not dropped), the WAL retention lock is still held.
pg_wal/ grows at the WAL generation rate of your write workload — on a busy OLTP system this is gigabytes per hour.
Disk fills → PostgreSQL panics → primary goes down. Replica was already down. You now have zero nodes.

Secondary blast radius: If this is a logical replication slot feeding a downstream analytics system or CDC pipeline (Debezium, pglogical, AWS DMS), that pipeline has a gap it cannot bridge. You will need a full re-snapshot.

How to Fix It

Step 1 — Confirm the Damage

-- Run on PRIMARY
SELECT
  slot_name,
  slot_type,
  active,
  active_pid,
  restart_lsn,
  confirmed_flush_lsn,
  pg_current_wal_lsn() AS current_lsn,
  pg_current_wal_lsn() - restart_lsn AS lag_bytes
FROM pg_replication_slots
ORDER BY restart_lsn;

If lag_bytes is in the hundreds of gigabytes, or if the WAL file derived from restart_lsn is absent from pg_wal/, the slot is unrecoverable.

# Derive the WAL filename from restart_lsn (replace with your LSN)
pg_walfile_name('0/1B000000')  -- run in psql
# Then check:
ls /var/lib/postgresql/data/pg_wal/000000010000001B*
# If empty output → file is gone. Slot is dead.

Basic Fix — Drop the Broken Slot

- -- Do NOT leave a broken slot running. It will fill your disk.
- SELECT * FROM pg_replication_slots WHERE slot_name = 'pglogical_sub_prod';

+ -- First, kill any zombie walsender/consumer holding the slot
+ SELECT pg_terminate_backend(active_pid)
+ FROM pg_replication_slots
+ WHERE slot_name = 'pglogical_sub_prod' AND active_pid IS NOT NULL;
+
+ -- Then drop it
+ SELECT pg_drop_replication_slot('pglogical_sub_prod');
+
+ -- Confirm pg_wal is now being reclaimed
+ SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

⚠️ Before dropping: confirm with your team that the subscriber/replica is truly dead and no in-flight transactions depend on this slot. For logical slots, this means a full re-snapshot of the subscriber.

Enterprise Best Practice — Prevent Silent Slot Bloat

postgresql.conf hardening:

- # No guardrails — slot can grow forever
- max_replication_slots = 10
- wal_keep_size = 0

+ # Hard-limit WAL retained by any single slot
+ max_slot_wal_keep_size = 10GB   # PostgreSQL 13+
+
+ # Keep enough WAL for planned maintenance windows
+ wal_keep_size = 2GB
+
+ # Alert before disk fills, not after
+ # Wire pg_replication_slots lag_bytes into your monitoring

Re-establish physical standby from scratch:

# On standby — full re-sync from primary
pg_basebackup \
  -h primary-host \
  -U replicator \
  -D /var/lib/postgresql/data \
  --wal-method=stream \
  --checkpoint=fast \
  --progress

Re-establish logical slot at current LSN (no historical data):

-- On PRIMARY — create fresh slot
SELECT pg_create_logical_replication_slot('pglogical_sub_prod', 'pgoutput');
-- Subscriber must do a full initial table sync (re-snapshot)

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Monitoring Alert (Prometheus + postgres_exporter)

# alerts.yml
- alert: ReplicationSlotWALLagCritical
  expr: pg_replication_slots_pg_wal_lsn_diff > 10737418240  # 10GB
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Slot {{ $labels.slot_name }} is {{ $value | humanize1024 }}B behind. WAL accumulating."

2. Automated Slot Reaper (cron / Kubernetes CronJob)

#!/bin/bash
# Drop any inactive slot lagging > 5GB — run every 15 min
psql -U postgres -c "
  SELECT pg_drop_replication_slot(slot_name)
  FROM pg_replication_slots
  WHERE active = false
    AND (pg_current_wal_lsn() - restart_lsn) > 5368709120;
"

3. Terraform / Ansible Guard

If you provision read replicas via Terraform (RDS, CloudSQL, self-managed), enforce max_slot_wal_keep_size as a non-negotiable parameter block. Use Checkov rule CKV_PG_1 or write a custom OPA policy to reject any PostgreSQL instance config missing this parameter.

- # RDS parameter group — no WAL slot cap
- resource "aws_db_parameter_group" "pg" {
-   family = "postgres15"
- }

+ resource "aws_db_parameter_group" "pg" {
+   family = "postgres15"
+   parameter {
+     name  = "max_slot_wal_keep_size"
+     value = "10240"  # 10GB in MB
+   }
+ }

4. Pre-deploy Check in CI

# Fail pipeline if any slot is lagging before a deploy
SLOT_LAG=$(psql -At -c "SELECT MAX(pg_current_wal_lsn() - restart_lsn) FROM pg_replication_slots;")
if [ "$SLOT_LAG" -gt 5368709120 ]; then
  echo "FATAL: Replication slot lag exceeds 5GB. Aborting deploy."
  exit 1
fi