How to Fix PostgreSQL 'number of requested standby connections exceeds max_wal_senders' Error
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins
TL;DR
- What broke:
max_wal_sendersis set too low; PostgreSQL refused a new WAL streaming connection from a standby or replication tool (pgBackRest, Barman, logical replication slot, read replica). - How to fix it: Increase
max_wal_sendersinpostgresql.confto cover all concurrent WAL consumers, thenSELECT pg_reload_conf();— no full restart required on Postgres 13+. - Use the Client-Side Sandbox above to paste your
postgresql.confand auto-generate the corrected parameter block without sending your config to a third-party server.
The Incident (What Does the Error Mean?)
Raw error output:
FATAL: number of requested standby connections exceeds max_wal_senders
(currently 10)
PostgreSQL maintains a fixed pool of WAL sender processes defined at startup. Every streaming replica, pg_basebackup job, logical replication publisher connection, and replication slot consumer eats one slot. When that pool is full, the next connection attempt dies immediately with this FATAL. The new replica does not connect. It does not retry silently. It fails hard.
If this is a primary→replica HA pair and the replica reconnects after a network blip, it will be refused — replica lag grows unbounded, and your failover target is now stale.
The Attack Vector / Blast Radius
This is a resource exhaustion / misconfiguration failure, not a security exploit, but the blast radius is severe:
- Replica starvation. Any standby that can't connect stops applying WAL. Under high write load,
pg_walon the primary grows untilwal_keep_sizeis exceeded or a replication slot holds WAL indefinitely — disk exhaustion on primary is now on the table. - Replication slot bloat. If you're using logical replication slots and the subscriber can't connect, the slot keeps accumulating WAL segments. This is one of the most common causes of unplanned primary disk-full outages.
- Backup pipeline failure.
pg_basebackupand pgBackRest streaming backups also consume WAL sender slots. A fullmax_wal_sendersmeans your backup job silently fails — you discover this during a recovery drill, not before. - Cascading HA failure. Patroni, repmgr, and Stolon all use streaming replication health as a fencing signal. A replica that can't connect may be incorrectly fenced or promoted, causing split-brain.
How to Fix It
Audit Your Current WAL Consumer Count First
-- Run on primary
SELECT client_addr, state, sent_lsn, write_lsn, application_name
FROM pg_stat_replication;
-- Count active + reserved slots
SELECT slot_name, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
Add up: streaming replicas + logical subscribers + backup agents + replication slots. That is your floor.
Basic Fix — Increase max_wal_senders
# postgresql.conf
- max_wal_senders = 10
+ max_wal_senders = 20
- max_replication_slots = 10
+ max_replication_slots = 20
Rule of thumb: max_wal_senders = (number of replicas) + (number of backup agents) + (number of logical subscribers) + 5 headroom
Apply without restart (Postgres 10+):
SELECT pg_reload_conf();
-- Verify
SHOW max_wal_senders;
⚠️
max_replication_slotsmust be ≥max_wal_senders. Mismatching these causes a separate class of failures.
Enterprise Best Practice — Parameterize and Guard in Terraform + Ansible
# terraform/modules/rds_pg/variables.tf (AWS RDS example)
- parameter {
- name = "max_wal_senders"
- value = "10"
- }
+ parameter {
+ name = "max_wal_senders"
+ value = var.max_wal_senders # default = 20, override per env
+ apply_method = "pending-reboot" # required for RDS; use immediate for self-managed
+ }
+
+ parameter {
+ name = "max_replication_slots"
+ value = var.max_replication_slots # keep in sync with max_wal_senders
+ apply_method = "pending-reboot"
+ }
# ansible/roles/postgres/templates/postgresql.conf.j2
- max_wal_senders = 10
+ max_wal_senders = {{ postgres_max_wal_senders | default(20) }}
+ max_replication_slots = {{ postgres_max_replication_slots | default(20) }}
+ wal_level = replica # must NOT be 'minimal' or WAL senders are disabled entirely
Also verify pg_hba.conf has the replication entry:
+ host replication replicator 10.0.0.0/8 scram-sha-256
Without this line, increasing max_wal_senders fixes the process pool but the auth handshake still fails.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov Policy — Catch Undersized WAL Senders in IaC
# checkov custom check: check_pg_wal_senders.py
from checkov.common.models.enums import CheckResult
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class PgMaxWalSendersCheck(BaseResourceCheck):
def __init__(self):
super().__init__(
name="Ensure max_wal_senders >= 20 for production RDS PG",
id="CKV_CUSTOM_PG_001",
supported_resources=["aws_db_parameter_group"]
)
def scan_resource_conf(self, conf):
params = conf.get("parameter", [])
for p in params:
if p.get("name") == "max_wal_senders":
if int(p.get("value", 0)) >= 20:
return CheckResult.PASSED
return CheckResult.FAILED
2. Prometheus Alert — Catch Saturation Before It Bites
# prometheus/alerts/postgres.yml
groups:
- name: postgres_replication
rules:
- alert: WalSendersSaturation
expr: |
pg_stat_replication_count / on(instance)
pg_settings_max_wal_senders > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "WAL senders at {{ $value | humanizePercentage }} capacity on {{ $labels.instance }}"
runbook: "https://wiki.internal/runbooks/postgres-wal-senders"
3. Pre-deploy Smoke Test
#!/usr/bin/env bash
# ci/scripts/check_pg_capacity.sh
MAX=$(psql -At -c "SHOW max_wal_senders;")
ACTIVE=$(psql -At -c "SELECT count(*) FROM pg_stat_replication;")
SLOTS=$(psql -At -c "SELECT count(*) FROM pg_replication_slots WHERE active = false;")
if (( ACTIVE + SLOTS >= MAX - 3 )); then
echo "FAIL: WAL sender headroom critically low ($ACTIVE active, $SLOTS idle slots, max=$MAX)"
exit 1
fi
echo "OK: WAL sender capacity nominal ($ACTIVE/$MAX used)"
Plug this into your GitHub Actions pre-deploy job or Spinnaker gate. It catches saturation before a new replica or backup agent is provisioned.