Initializing Enclave...

How to Fix PostgreSQL 'archive command failed: return code X' and Restore WAL Archiving

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins


TL;DR

  • What broke: PostgreSQL executed archive_command for a WAL segment, the command exited with a non-zero return code, and Postgres is now retrying every archive_timeout interval — WAL segments are accumulating in pg_wal/ and your PITR/standby pipeline is stalled.
  • How to fix it: Reproduce the failing archive_command manually as the postgres OS user, resolve the underlying permission, path, or network mount error, then confirm pg_stat_archiver.last_archived_wal advances.
  • Sandbox: Use our Client-Side Sandbox below to paste your postgresql.conf archive block and pg_log tail — it will auto-diagnose the root cause and generate the corrected archive_command without sending your config off-device.

The Incident (What Does the Error Mean?)

The raw log entry looks like this:

LOG:  archive command failed with exit code 1
DETAIL:  The failed archive command was: rsync -a /var/lib/postgresql/14/main/pg_wal/000000010000000000000001 [email protected]:/mnt/wal_archive/000000010000000000000001
LOG:  archive command failed with exit code 23
DETAIL:  The failed archive command was: cp /var/lib/postgresql/14/main/pg_wal/000000010000000000000023 /mnt/nfs/wal_archive/000000010000000000000023

Immediate consequence: PostgreSQL will never delete a WAL segment from pg_wal/ until archiving succeeds. Every checkpoint cycle produces new WAL. Within hours — sometimes minutes on write-heavy clusters — pg_wal/ fills the data volume, Postgres panics with PANIC: could not write to file "pg_wal/xlogtemp", and the primary goes down hard. Your standbys, already starved of WAL, will eventually diverge past wal_keep_size and require a full pg_basebackup rebuild.


The Attack Vector / Blast Radius

This is not a soft warning. The blast radius is:

  1. Disk exhaustion on the primary. pg_wal/ has no internal size cap when archiving is broken. A 500 GB data volume can fill completely in under 30 minutes on a busy OLTP cluster. Postgres will PANIC and refuse connections.

  2. Complete PITR gap. Any WAL segment that fails to archive is a hole in your recovery timeline. If the primary crashes before the segment is archived, that data is unrecoverable via pg_restore or pg_basebackup replay. Your RPO is now undefined.

  3. Streaming standby divergence. Physical standbys relying on WAL streaming can tolerate short gaps, but once the primary's wal_keep_size is exhausted and the segment is not archived, the standby enters a walreceiver error loop and must be rebuilt. In a Patroni/Repmgr HA cluster, this can trigger a false failover.

  4. Silent failure window. archive_command failures are logged but do not raise a Postgres error to application clients. Without active monitoring on pg_stat_archiver.failed_count, this can go undetected for hours.

Common root causes by exit code:

Exit Code Most Likely Cause
1 Generic rsync/cp error — check stderr in pg_log
11 rsync: SIGSEGV or out-of-memory on archive host
23 rsync: partial transfer — destination path missing or permission denied
255 SSH connection refused or host key mismatch
126 Archive command binary not executable by postgres user
127 Binary not found in $PATH at Postgres startup environment

How to Fix It

Step 0: Reproduce the Failure Manually

Always do this first. Postgres runs archive_command as the postgres OS user with a stripped environment. Reproduce it exactly:

# Switch to the postgres OS user
su - postgres

# Copy the exact command from DETAIL: in pg_log and run it manually
rsync -av /var/lib/postgresql/14/main/pg_wal/000000010000000000000001 \
  [email protected]:/mnt/wal_archive/000000010000000000000001

# Check the exit code
echo $?

The stderr output here is your actual error. Postgres swallows it in the LOG line.


Basic Fix: Correct the archive_command

The most common failure pattern is a missing destination directory or wrong permissions. Fix the destination, then update postgresql.conf:

# postgresql.conf

- archive_command = 'cp %p /mnt/wal_archive/%f'
+ archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'

Why test ! -f matters: Without this guard, if the file already exists (e.g., after a failed partial copy), cp may silently overwrite a good segment or fail depending on filesystem permissions. The guard makes the command idempotent — Postgres's contract is that archive_command must succeed if the file is already safely archived.

For rsync over SSH:

- archive_command = 'rsync -a %p [email protected]:/mnt/wal_archive/%f'
+ archive_command = 'rsync -a --ignore-existing %p [email protected]:/mnt/wal_archive/%f'

After editing postgresql.conf, reload, do not restart:

psql -U postgres -c "SELECT pg_reload_conf();"

# Force a WAL switch to trigger immediate archiving attempt
psql -U postgres -c "SELECT pg_switch_wal();"

# Verify archiving recovered
psql -U postgres -c "SELECT last_archived_wal, last_archived_time, failed_count FROM pg_stat_archiver;"

Enterprise Best Practice: Hardened Shell Wrapper with Logging and Alerting

Never put complex logic directly in archive_command. Use a wrapper script with proper error handling, logging, and a dead man's switch:

# postgresql.conf

- archive_command = 'rsync -a %p [email protected]:/mnt/wal_archive/%f'
+ archive_command = '/opt/postgres/bin/archive_wal.sh %p %f'
# /opt/postgres/bin/archive_wal.sh
#!/bin/bash
# Hardened WAL archive wrapper
# Must be owned by root, executable by postgres: chmod 750, chown root:postgres

set -euo pipefail

WAL_PATH="$1"
WAL_FILE="$2"
ARCHIVE_HOST="[email protected]"
ARCHIVE_DIR="/mnt/wal_archive"
LOG_FILE="/var/log/postgresql/wal_archive.log"
LOCKFILE="/tmp/wal_archive_${WAL_FILE}.lock"

log() {
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) [$$] $*" >> "$LOG_FILE"
}

# Prevent concurrent archiving of the same segment (rare but possible during crash recovery)
exec 9>"$LOCKFILE"
flock -n 9 || { log "WARN: Lock held for $WAL_FILE, skipping duplicate attempt"; exit 0; }

log "INFO: Starting archive of $WAL_FILE"

# Idempotency check: if already archived successfully, exit 0 immediately
if ssh -o BatchMode=yes -o ConnectTimeout=10 "$ARCHIVE_HOST" \
     "test -f ${ARCHIVE_DIR}/${WAL_FILE}" 2>/dev/null; then
  log "INFO: $WAL_FILE already archived, skipping"
  exit 0
fi

# Perform the transfer
if rsync -a --timeout=60 "$WAL_PATH" "${ARCHIVE_HOST}:${ARCHIVE_DIR}/${WAL_FILE}"; then
  log "INFO: Successfully archived $WAL_FILE"
  # Optional: push metric to your monitoring system
  # curl -sf -X POST "http://pushgateway:9091/metrics/job/wal_archiver" \
  #   --data-binary "wal_archive_success_total 1" || true
  exit 0
else
  EXIT_CODE=$?
  log "ERROR: rsync failed for $WAL_FILE with exit code $EXIT_CODE"
  # PagerDuty/Alertmanager webhook call here
  exit $EXIT_CODE
fi

Permissions setup:

chown root:postgres /opt/postgres/bin/archive_wal.sh
chmod 750 /opt/postgres/bin/archive_wal.sh
mkdir -p /var/log/postgresql
chown postgres:postgres /var/log/postgresql

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Monitor pg_stat_archiver — The Only Reliable Signal

Add this as a Prometheus query via postgres_exporter:

# prometheus/rules/postgres.yml
groups:
  - name: postgres_wal
    rules:
      - alert: WALArchivingFailing
        expr: pg_stat_archiver_failed_count > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "WAL archiving has failed {{ $value }} times on {{ $labels.instance }}"
          runbook: "https://your-wiki/runbooks/postgres-wal-archive-failure"

      - alert: WALArchivingStalled
        expr: time() - pg_stat_archiver_last_archived_time_seconds > 300
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "No WAL segment archived in >5 minutes on {{ $labels.instance }}"

2. Validate archive_command in CI Before Deployment

If you manage postgresql.conf via Ansible, Terraform, or Helm, add a pre-flight validation job:

# In your CI pipeline (GitHub Actions / GitLab CI)
- name: Validate WAL archive connectivity
  run: |
    # Test SSH key auth and destination writability before deploying config
    ssh -o BatchMode=yes -o ConnectTimeout=5 archive@$ARCHIVE_HOST \
      "mkdir -p /mnt/wal_archive && touch /mnt/wal_archive/.write_test && rm /mnt/wal_archive/.write_test"
    echo "Archive destination reachable and writable"

3. Use pgBackRest or WAL-G Instead of Raw archive_command

For any production cluster, stop writing raw cp/rsync archive commands. Both tools handle retries, compression, encryption, parallelism, and idempotency correctly:

# postgresql.conf — pgBackRest example

- archive_command = 'cp %p /mnt/wal_archive/%f'
+ archive_command = 'pgbackrest --stanza=main archive-push %p'
+ archive_mode = on
# postgresql.conf — WAL-G example (S3)

- archive_command = 'rsync -a %p archive@host:/mnt/wal_archive/%f'
+ archive_command = 'wal-g wal-push %p'

Both tools return exit code 0 only on confirmed durable write. They also expose metrics endpoints and have native retry logic with exponential backoff — eliminating the entire class of transient network failures that raw shell commands cannot handle gracefully.

4. Checkov / Ansible Lint Guardrails

If you provision Postgres config via Ansible:

# ansible/roles/postgres/tasks/validate_archive.yml
- name: Assert archive_command is not a raw cp or rsync
  assert:
    that:
      - "'pgbackrest' in postgresql_archive_command or 'wal-g' in postgresql_archive_command"
    fail_msg: >-
      POLICY VIOLATION: archive_command must use pgBackRest or WAL-G.
      Raw cp/rsync commands have no retry logic and will cause disk exhaustion.
      Current value: {{ postgresql_archive_command }}

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →