How to Fix PostgreSQL PANIC: could not write to file pg_wal (WAL Disk Full Crisis)
Threat/Impact Level: CRITICAL | Downtime Risk: HIGH — full write halt, cluster PANIC restart required | Time to Fix: 15–45 mins
TL;DR
- What broke: PostgreSQL exhausted the disk partition hosting
pg_wal/, triggering a kernel-level PANIC. All transactions are dead. The postmaster has exited. - How to fix it: Identify and drop stale replication slots or archive_command failures that are pinning WAL segments, free disk space immediately, then tune
wal_keep_sizeandmax_slot_wal_keep_size. - Fast path: Drop your
postgresql.conf, currentpg_replication_slotsoutput, andpg_wal/du into the Client-Side Sandbox above — it will auto-diagnose the pinned slot and generate the exactALTER SYSTEMcommands without sending your config off-machine.
The Incident (What does the error mean?)
Raw error from PostgreSQL logs:
PANIC: could not write to file "pg_wal/xlogtemp.7F4A2B": No space left on device
LOG: startup process (PID 1234) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
PostgreSQL's WAL writer attempted to flush a WAL segment to $PGDATA/pg_wal/ and received ENOSPC from the kernel. This is not a graceful error — PostgreSQL issues a PANIC, which means the postmaster process aborts immediately. There is no partial recovery. Every in-flight transaction is rolled back. Your application is throwing connection refused errors right now.
The pg_wal/ directory is on a dedicated partition (or shares one with $PGDATA). Either way, it is at 100% utilization.
The Attack Vector / Blast Radius
This is a cascading storage failure, not a single-point event. The usual kill chain:
- Stale replication slot — A physical or logical replication slot exists for a replica that is lagging or dead. PostgreSQL cannot remove WAL segments that the slot's
restart_lsnhasn't consumed. Segments accumulate indefinitely. - Broken
archive_command— Ifarchive_mode = onand yourarchive_command(S3, NFS, whatever) starts failing silently, PostgreSQL retains every unarchived segment.pg_wal/grows until disk death. wal_keep_sizeset too aggressively — A well-intentioned but oversized value forces retention of gigabytes of segments regardless of replication state.- Blast radius: Once PANIC fires — all databases on the cluster are offline, connection poolers (PgBouncer, RDS Proxy) start queuing or dropping connections, downstream services begin throwing 500s, and any replica in streaming replication enters a reconnect loop. If this is a primary in a HA pair, your failover automation may promote a replica that is itself behind — meaning potential data divergence.
Check this first — stale slots are the #1 cause:
SELECT slot_name, slot_type, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY retained_wal DESC;
If retained_wal is in the tens or hundreds of GB and active = false — that slot is your culprit.
How to Fix It
Step 0 — Emergency Disk Relief (Do This First)
Before PostgreSQL can restart, the disk needs headroom. Do not delete random files blindly.
# Identify pg_wal size
du -sh $PGDATA/pg_wal/
du -sh $PGDATA/pg_wal/* | sort -rh | head -20
# Check overall disk
df -h $PGDATA
If you have a pg_wal/archive_status/ directory full of .ready files, your archiver is backed up — fix the archive destination first, or temporarily archive_mode = off to unblock.
Basic Fix — Drop the Stale Replication Slot
-- Confirm the slot is inactive and consuming WAL
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
-- Drop it (this is irreversible — the replica must re-baseline)
SELECT pg_drop_replication_slot('your_stale_slot_name');
After dropping, PostgreSQL's WAL recycler will immediately begin removing segments older than restart_lsn. Watch pg_wal/ shrink.
Enterprise Best Practice — Cap WAL Retention at the Slot Level
The real fix is ensuring this can never silently accumulate again.
# postgresql.conf
- # max_slot_wal_keep_size = -1 # default: unlimited — this is the footgun
+ max_slot_wal_keep_size = 10GB # hard cap; slot is invalidated before disk dies
- wal_keep_size = 10240 # 10GB blanket retention — too aggressive for most
+ wal_keep_size = 1024 # 1GB; replicas should use slots, not this knob
- archive_command = 'cp %p /mnt/wal_archive/%f' # no error handling
+ archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f || exit 1'
# Fail loudly so pg_wal doesn't silently fill
Apply without full restart:
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
ALTER SYSTEM SET wal_keep_size = '1024';
SELECT pg_reload_conf();
For logical replication slots specifically — if the subscriber is gone for good:
-- List logical slots
SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';
-- Drop with extreme prejudice if subscriber is decommissioned
SELECT pg_drop_replication_slot('logical_slot_name');
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Alerting — Non-Negotiable Thresholds
# Prometheus alerting rule (alerts.yml)
groups:
- name: postgres_wal
rules:
- alert: PgWalDiskUsageCritical
expr: (
node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} -
node_filesystem_free_bytes{mountpoint="/var/lib/postgresql"}
) / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} > 0.80
for: 5m
labels:
severity: critical
annotations:
summary: "pg_wal partition >80% full — replication slot audit required"
- alert: PgStaleReplicationSlot
expr: pg_replication_slots_pg_wal_lsn_diff_bytes{active="false"} > 5368709120 # 5GB
for: 10m
labels:
severity: critical
annotations:
summary: "Inactive replication slot retaining >5GB WAL"
2. Automated Slot Hygiene (Cron / Kubernetes CronJob)
#!/bin/bash
# Drop replication slots inactive for >2 hours with >2GB retained WAL
psql -U postgres -c "
SELECT pg_drop_replication_slot(slot_name)
FROM pg_replication_slots
WHERE active = false
AND pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 2147483648
AND slot_name NOT IN (SELECT slot_name FROM pg_stat_replication);
"
3. Terraform / IaC — Enforce Disk Separation
# RDS / self-managed: always isolate pg_wal on its own EBS volume
- # pg_wal co-located with $PGDATA on single 100GB gp3 volume
+ resource "aws_ebs_volume" "pg_wal" {
+ availability_zone = var.az
+ size = 200
+ type = "gp3"
+ iops = 6000
+ tags = { Name = "postgres-pg-wal-dedicated" }
+ }
4. Checkov Policy (IaC Scan)
Add a custom Checkov check asserting that any PostgreSQL instance definition includes a max_slot_wal_keep_size parameter — block the Terraform plan in CI if it's absent or set to -1.