Will PostgreSQL lose data if archive_command keeps failing?

Not immediately — Postgres retains every unarchived WAL segment in pg_wal/ and retries archiving continuously. However, if the primary crashes before a segment is successfully archived, that segment is permanently lost and your PITR recovery chain has an unrecoverable gap. The more immediate risk is disk exhaustion: pg_wal/ will grow without bound until the volume fills and Postgres panics with a PANIC-level error, taking the primary offline.

How do I find which WAL segment is stuck and why?

Run: SELECT last_archived_wal, last_archived_time, last_failed_wal, last_failed_time, failed_count FROM pg_stat_archiver; The last_failed_wal column shows exactly which segment is blocked. Then reproduce the failure manually as the postgres OS user by copy-pasting the exact command from the DETAIL: line in pg_log and running it in a shell — the stderr output will reveal the actual error (permission denied, no route to host, disk full on destination, etc.) that Postgres itself does not surface in its log.

Can I safely clear the pg_wal directory to free disk space during an active archive failure?

No. Never manually delete files from pg_wal/ on a running primary. Postgres manages this directory exclusively. Deleting WAL segments that have not been archived will corrupt your recovery chain and may crash the running instance if it attempts to read a deleted segment. If disk pressure is critical, your only safe options are: (1) fix the archive destination immediately so Postgres can archive and self-clean, (2) expand the volume, or (3) as a last resort on a non-HA instance, temporarily set archive_mode = off, restart Postgres, manually archive the backlog, then re-enable — but this creates a PITR gap.

How to Fix PostgreSQL 'archive command failed: return code X' and Restore WAL Archiving

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: PostgreSQL executed archive_command for a WAL segment, the command exited with a non-zero return code, and Postgres is now retrying every archive_timeout interval — WAL segments are accumulating in pg_wal/ and your PITR/standby pipeline is stalled.
How to fix it: Reproduce the failing archive_command manually as the postgres OS user, resolve the underlying permission, path, or network mount error, then confirm pg_stat_archiver.last_archived_wal advances.
Sandbox: Use our Client-Side Sandbox below to paste your postgresql.conf archive block and pg_log tail — it will auto-diagnose the root cause and generate the corrected archive_command without sending your config off-device.

The Incident (What Does the Error Mean?)

The raw log entry looks like this:

LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: rsync -a /var/lib/postgresql/14/main/pg_wal/000000010000000000000001 [email protected]:/mnt/wal_archive/000000010000000000000001
LOG:  archive command failed with exit code 23
DETAIL:  The failed archive command was: cp /var/lib/postgresql/14/main/pg_wal/000000010000000000000023 /mnt/nfs/wal_archive/000000010000000000000023

Immediate consequence: PostgreSQL will never delete a WAL segment from pg_wal/ until archiving succeeds. Every checkpoint cycle produces new WAL. Within hours — sometimes minutes on write-heavy clusters — pg_wal/ fills the data volume, Postgres panics with PANIC: could not write to file "pg_wal/xlogtemp", and the primary goes down hard. Your standbys, already starved of WAL, will eventually diverge past wal_keep_size and require a full pg_basebackup rebuild.

The Attack Vector / Blast Radius

This is not a soft warning. The blast radius is:

Disk exhaustion on the primary. pg_wal/ has no internal size cap when archiving is broken. A 500 GB data volume can fill completely in under 30 minutes on a busy OLTP cluster. Postgres will PANIC and refuse connections.
Complete PITR gap. Any WAL segment that fails to archive is a hole in your recovery timeline. If the primary crashes before the segment is archived, that data is unrecoverable via pg_restore or pg_basebackup replay. Your RPO is now undefined.
Streaming standby divergence. Physical standbys relying on WAL streaming can tolerate short gaps, but once the primary's wal_keep_size is exhausted and the segment is not archived, the standby enters a walreceiver error loop and must be rebuilt. In a Patroni/Repmgr HA cluster, this can trigger a false failover.
Silent failure window. archive_command failures are logged but do not raise a Postgres error to application clients. Without active monitoring on pg_stat_archiver.failed_count, this can go undetected for hours.

Common root causes by exit code:

Exit Code	Most Likely Cause
`1`	Generic rsync/cp error — check stderr in pg_log
`11`	rsync: `SIGSEGV` or out-of-memory on archive host
`23`	rsync: partial transfer — destination path missing or permission denied
`255`	SSH connection refused or host key mismatch
`126`	Archive command binary not executable by `postgres` user
`127`	Binary not found in `$PATH` at Postgres startup environment

How to Fix It

Step 0: Reproduce the Failure Manually

Always do this first. Postgres runs archive_command as the postgres OS user with a stripped environment. Reproduce it exactly:

# Switch to the postgres OS user
su - postgres

# Copy the exact command from DETAIL: in pg_log and run it manually
rsync -av /var/lib/postgresql/14/main/pg_wal/000000010000000000000001 \
  [email protected]:/mnt/wal_archive/000000010000000000000001

# Check the exit code
echo $?

The stderr output here is your actual error. Postgres swallows it in the LOG line.

Basic Fix: Correct the archive_command

The most common failure pattern is a missing destination directory or wrong permissions. Fix the destination, then update postgresql.conf:

# postgresql.conf

- archive_command = 'cp %p /mnt/wal_archive/%f'
+ archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'

Why test ! -f matters: Without this guard, if the file already exists (e.g., after a failed partial copy), cp may silently overwrite a good segment or fail depending on filesystem permissions. The guard makes the command idempotent — Postgres's contract is that archive_command must succeed if the file is already safely archived.

For rsync over SSH:

- archive_command = 'rsync -a %p [email protected]:/mnt/wal_archive/%f'
+ archive_command = 'rsync -a --ignore-existing %p [email protected]:/mnt/wal_archive/%f'

After editing postgresql.conf, reload, do not restart:

psql -U postgres -c "SELECT pg_reload_conf();"

# Force a WAL switch to trigger immediate archiving attempt
psql -U postgres -c "SELECT pg_switch_wal();"

# Verify archiving recovered
psql -U postgres -c "SELECT last_archived_wal, last_archived_time, failed_count FROM pg_stat_archiver;"

Enterprise Best Practice: Hardened Shell Wrapper with Logging and Alerting

Never put complex logic directly in archive_command. Use a wrapper script with proper error handling, logging, and a dead man's switch:

# postgresql.conf

- archive_command = 'rsync -a %p [email protected]:/mnt/wal_archive/%f'
+ archive_command = '/opt/postgres/bin/archive_wal.sh %p %f'

# /opt/postgres/bin/archive_wal.sh
#!/bin/bash
# Hardened WAL archive wrapper
# Must be owned by root, executable by postgres: chmod 750, chown root:postgres

set -euo pipefail

WAL_PATH="$1"
WAL_FILE="$2"
ARCHIVE_HOST="[email protected]"
ARCHIVE_DIR="/mnt/wal_archive"
LOG_FILE="/var/log/postgresql/wal_archive.log"
LOCKFILE="/tmp/wal_archive_${WAL_FILE}.lock"

log() {
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) [$$] $*" >> "$LOG_FILE"
}

# Prevent concurrent archiving of the same segment (rare but possible during crash recovery)
exec 9>"$LOCKFILE"
flock -n 9 || { log "WARN: Lock held for $WAL_FILE, skipping duplicate attempt"; exit 0; }

log "INFO: Starting archive of $WAL_FILE"

# Idempotency check: if already archived successfully, exit 0 immediately
if ssh -o BatchMode=yes -o ConnectTimeout=10 "$ARCHIVE_HOST" \
     "test -f ${ARCHIVE_DIR}/${WAL_FILE}" 2>/dev/null; then
  log "INFO: $WAL_FILE already archived, skipping"
  exit 0
fi

# Perform the transfer
if rsync -a --timeout=60 "$WAL_PATH" "${ARCHIVE_HOST}:${ARCHIVE_DIR}/${WAL_FILE}"; then
  log "INFO: Successfully archived $WAL_FILE"
  # Optional: push metric to your monitoring system
  # curl -sf -X POST "http://pushgateway:9091/metrics/job/wal_archiver" \
  #   --data-binary "wal_archive_success_total 1" || true
  exit 0
else
  EXIT_CODE=$?
  log "ERROR: rsync failed for $WAL_FILE with exit code $EXIT_CODE"
  # PagerDuty/Alertmanager webhook call here
  exit $EXIT_CODE
fi

Permissions setup:

chown root:postgres /opt/postgres/bin/archive_wal.sh
chmod 750 /opt/postgres/bin/archive_wal.sh
mkdir -p /var/log/postgresql
chown postgres:postgres /var/log/postgresql

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Monitor `pg_stat_archiver` — The Only Reliable Signal

Add this as a Prometheus query via postgres_exporter:

# prometheus/rules/postgres.yml
groups:
  - name: postgres_wal
    rules:
      - alert: WALArchivingFailing
        expr: pg_stat_archiver_failed_count > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "WAL archiving has failed {{ $value }} times on {{ $labels.instance }}"
          runbook: "https://your-wiki/runbooks/postgres-wal-archive-failure"

      - alert: WALArchivingStalled
        expr: time() - pg_stat_archiver_last_archived_time_seconds > 300
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "No WAL segment archived in >5 minutes on {{ $labels.instance }}"

2. Validate archive_command in CI Before Deployment

If you manage postgresql.conf via Ansible, Terraform, or Helm, add a pre-flight validation job:

# In your CI pipeline (GitHub Actions / GitLab CI)
- name: Validate WAL archive connectivity
  run: |
    # Test SSH key auth and destination writability before deploying config
    ssh -o BatchMode=yes -o ConnectTimeout=5 archive@$ARCHIVE_HOST \
      "mkdir -p /mnt/wal_archive && touch /mnt/wal_archive/.write_test && rm /mnt/wal_archive/.write_test"
    echo "Archive destination reachable and writable"

3. Use pgBackRest or WAL-G Instead of Raw archive_command

For any production cluster, stop writing raw cp/rsync archive commands. Both tools handle retries, compression, encryption, parallelism, and idempotency correctly:

# postgresql.conf — pgBackRest example

- archive_command = 'cp %p /mnt/wal_archive/%f'
+ archive_command = 'pgbackrest --stanza=main archive-push %p'
+ archive_mode = on

# postgresql.conf — WAL-G example (S3)

- archive_command = 'rsync -a %p archive@host:/mnt/wal_archive/%f'
+ archive_command = 'wal-g wal-push %p'

Both tools return exit code 0 only on confirmed durable write. They also expose metrics endpoints and have native retry logic with exponential backoff — eliminating the entire class of transient network failures that raw shell commands cannot handle gracefully.

4. Checkov / Ansible Lint Guardrails

If you provision Postgres config via Ansible:

# ansible/roles/postgres/tasks/validate_archive.yml
- name: Assert archive_command is not a raw cp or rsync
  assert:
    that:
      - "'pgbackrest' in postgresql_archive_command or 'wal-g' in postgresql_archive_command"
    fail_msg: >-
      POLICY VIOLATION: archive_command must use pgBackRest or WAL-G.
      Raw cp/rsync commands have no retry logic and will cause disk exhaustion.
      Current value: {{ postgresql_archive_command }}