Initializing Enclave...

How to Fix PostgreSQL PANIC: could not write to file pg_wal (WAL Disk Full Crisis)

Threat/Impact Level: CRITICAL | Downtime Risk: HIGH — full write halt, cluster PANIC restart required | Time to Fix: 15–45 mins

TL;DR

  • What broke: PostgreSQL exhausted the disk partition hosting pg_wal/, triggering a kernel-level PANIC. All transactions are dead. The postmaster has exited.
  • How to fix it: Identify and drop stale replication slots or archive_command failures that are pinning WAL segments, free disk space immediately, then tune wal_keep_size and max_slot_wal_keep_size.
  • Fast path: Drop your postgresql.conf, current pg_replication_slots output, and pg_wal/ du into the Client-Side Sandbox above — it will auto-diagnose the pinned slot and generate the exact ALTER SYSTEM commands without sending your config off-machine.

The Incident (What does the error mean?)

Raw error from PostgreSQL logs:

PANIC: could not write to file "pg_wal/xlogtemp.7F4A2B": No space left on device
LOG:  startup process (PID 1234) was terminated by signal 6: Aborted
LOG:  aborting startup due to startup process failure

PostgreSQL's WAL writer attempted to flush a WAL segment to $PGDATA/pg_wal/ and received ENOSPC from the kernel. This is not a graceful error — PostgreSQL issues a PANIC, which means the postmaster process aborts immediately. There is no partial recovery. Every in-flight transaction is rolled back. Your application is throwing connection refused errors right now.

The pg_wal/ directory is on a dedicated partition (or shares one with $PGDATA). Either way, it is at 100% utilization.


The Attack Vector / Blast Radius

This is a cascading storage failure, not a single-point event. The usual kill chain:

  1. Stale replication slot — A physical or logical replication slot exists for a replica that is lagging or dead. PostgreSQL cannot remove WAL segments that the slot's restart_lsn hasn't consumed. Segments accumulate indefinitely.
  2. Broken archive_command — If archive_mode = on and your archive_command (S3, NFS, whatever) starts failing silently, PostgreSQL retains every unarchived segment. pg_wal/ grows until disk death.
  3. wal_keep_size set too aggressively — A well-intentioned but oversized value forces retention of gigabytes of segments regardless of replication state.
  4. Blast radius: Once PANIC fires — all databases on the cluster are offline, connection poolers (PgBouncer, RDS Proxy) start queuing or dropping connections, downstream services begin throwing 500s, and any replica in streaming replication enters a reconnect loop. If this is a primary in a HA pair, your failover automation may promote a replica that is itself behind — meaning potential data divergence.

Check this first — stale slots are the #1 cause:

SELECT slot_name, slot_type, active, restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY retained_wal DESC;

If retained_wal is in the tens or hundreds of GB and active = false — that slot is your culprit.


How to Fix It

Step 0 — Emergency Disk Relief (Do This First)

Before PostgreSQL can restart, the disk needs headroom. Do not delete random files blindly.

# Identify pg_wal size
du -sh $PGDATA/pg_wal/
du -sh $PGDATA/pg_wal/* | sort -rh | head -20

# Check overall disk
df -h $PGDATA

If you have a pg_wal/archive_status/ directory full of .ready files, your archiver is backed up — fix the archive destination first, or temporarily archive_mode = off to unblock.

Basic Fix — Drop the Stale Replication Slot

-- Confirm the slot is inactive and consuming WAL
SELECT slot_name, active, restart_lsn FROM pg_replication_slots;

-- Drop it (this is irreversible — the replica must re-baseline)
SELECT pg_drop_replication_slot('your_stale_slot_name');

After dropping, PostgreSQL's WAL recycler will immediately begin removing segments older than restart_lsn. Watch pg_wal/ shrink.

Enterprise Best Practice — Cap WAL Retention at the Slot Level

The real fix is ensuring this can never silently accumulate again.

# postgresql.conf

- # max_slot_wal_keep_size = -1   # default: unlimited — this is the footgun
+ max_slot_wal_keep_size = 10GB   # hard cap; slot is invalidated before disk dies

- wal_keep_size = 10240           # 10GB blanket retention — too aggressive for most
+ wal_keep_size = 1024            # 1GB; replicas should use slots, not this knob

- archive_command = 'cp %p /mnt/wal_archive/%f'   # no error handling
+ archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f || exit 1'
# Fail loudly so pg_wal doesn't silently fill

Apply without full restart:

ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
ALTER SYSTEM SET wal_keep_size = '1024';
SELECT pg_reload_conf();

For logical replication slots specifically — if the subscriber is gone for good:

-- List logical slots
SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';

-- Drop with extreme prejudice if subscriber is decommissioned
SELECT pg_drop_replication_slot('logical_slot_name');

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Alerting — Non-Negotiable Thresholds

# Prometheus alerting rule (alerts.yml)
groups:
  - name: postgres_wal
    rules:
      - alert: PgWalDiskUsageCritical
        expr: (
          node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} -
          node_filesystem_free_bytes{mountpoint="/var/lib/postgresql"}
        ) / node_filesystem_size_bytes{mountpoint="/var/lib/postgresql"} > 0.80
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "pg_wal partition >80% full — replication slot audit required"

      - alert: PgStaleReplicationSlot
        expr: pg_replication_slots_pg_wal_lsn_diff_bytes{active="false"} > 5368709120  # 5GB
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Inactive replication slot retaining >5GB WAL"

2. Automated Slot Hygiene (Cron / Kubernetes CronJob)

#!/bin/bash
# Drop replication slots inactive for >2 hours with >2GB retained WAL
psql -U postgres -c "
  SELECT pg_drop_replication_slot(slot_name)
  FROM pg_replication_slots
  WHERE active = false
    AND pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 2147483648
    AND slot_name NOT IN (SELECT slot_name FROM pg_stat_replication);
"

3. Terraform / IaC — Enforce Disk Separation

# RDS / self-managed: always isolate pg_wal on its own EBS volume
- # pg_wal co-located with $PGDATA on single 100GB gp3 volume
+ resource "aws_ebs_volume" "pg_wal" {
+   availability_zone = var.az
+   size              = 200
+   type              = "gp3"
+   iops              = 6000
+   tags = { Name = "postgres-pg-wal-dedicated" }
+ }

4. Checkov Policy (IaC Scan)

Add a custom Checkov check asserting that any PostgreSQL instance definition includes a max_slot_wal_keep_size parameter — block the Terraform plan in CI if it's absent or set to -1.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →