How to Fix PostgreSQL FATAL: role 'replication' does not have replication permission
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 5 mins
TL;DR
- What broke: The PostgreSQL role used for streaming replication was created without the
REPLICATIONattribute — PostgreSQL rejects the connection at the authentication layer before WAL streaming begins. - How to fix it: Run
ALTER ROLE replication REPLICATION LOGIN;on the primary and confirmpg_hba.confhas areplicationconnection entry for that role. - Fast path: Use our Client-Side Sandbox below to auto-refactor your
CREATE ROLEstatement orpg_hba.conf— paste it in and get corrected SQL instantly.
The Incident (What does the error mean?)
FATAL: role "replication" does not have replication permission
This fires on the standby, logged during pg_basebackup or recovery.conf / primary_conninfo connection attempts. PostgreSQL's WAL sender process on the primary validates the connecting role's rolreplication flag in pg_authid before it opens a replication slot or streams WAL. If that flag is false, the connection dies immediately.
Immediate consequence: The standby is receiving zero WAL. It is frozen at the LSN it last received. If the primary fails right now, your failover target is stale. Patroni, repmgr, or Pacemaker will either promote a lagging replica or refuse to promote entirely.
The Attack Vector / Blast Radius
This is a misconfigured privilege failure, not a network failure. The blast radius:
- HA is silently broken. Monitoring that only checks
pg_stat_replicationrow existence may not catch this — the row never appears because the connection never completes. - RPO is undefined. You have no idea how long the replica has been dark. In a failover, you promote a replica with an unknown data gap.
pg_replication_slotsleak. If a slot was pre-created and the subscriber can never connect,pg_wal_lsn_diffonpg_replication_slots.confirmed_flush_lsnwill grow unbounded, causing WAL accumulation on the primary disk — a separate outage vector.- Security note: Granting
SUPERUSERas a workaround (common in rushed fixes) gives the replication user full DDL/DML access. A compromised replication credential then becomes a full database takeover. Never do this.
How to Fix It (The Solution)
Basic Fix
On the primary, check the current state:
SELECT rolname, rolreplication, rolcanlogin FROM pg_authid WHERE rolname = 'replication';
If rolreplication = false, patch it:
- CREATE ROLE replication LOGIN PASSWORD 'secret';
+ CREATE ROLE replication REPLICATION LOGIN PASSWORD 'secret';
Or if the role already exists:
- -- role exists without REPLICATION flag
+ ALTER ROLE replication REPLICATION LOGIN;
No PostgreSQL restart required. The next connection attempt from the standby will succeed immediately.
pg_hba.conf — Verify the replication entry
- host all replication 10.0.1.0/24 md5
+ host replication replication 10.0.1.0/24 scram-sha-256
Two things matter here: the database column must be replication (literal keyword, not a DB name), and use scram-sha-256 over md5 on any PostgreSQL ≥ 14 cluster.
After editing pg_hba.conf, reload (no restart):
psql -c "SELECT pg_reload_conf();"
# or
systemctl reload postgresql
Enterprise Best Practice — Least-Privilege Replication Role
Do not reuse application roles or grant superuser. Isolate replication identity:
- CREATE ROLE app_user SUPERUSER LOGIN PASSWORD 'apppass'; -- used for replication too
+ CREATE ROLE replicator REPLICATION LOGIN PASSWORD '<vault-generated>' CONNECTION LIMIT 3;
+ -- Grant only if using logical replication (PG14+):
+ GRANT pg_read_all_data TO replicator; -- logical only, omit for physical
For logical replication, the role also needs SELECT on published tables or pg_read_all_data. For physical/streaming, REPLICATION LOGIN is sufficient — nothing else.
Store the password in Vault or AWS Secrets Manager. Rotate it via ALTER ROLE replicator PASSWORD '...' — active WAL sender connections survive a password rotation without disconnect.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Terraform / Pulumi — enforce at provisioning time:
- resource "postgresql_role" "replicator" {
- name = "replicator"
- login = true
- password = var.replication_password
- }
+ resource "postgresql_role" "replicator" {
+ name = "replicator"
+ login = true
+ replication = true
+ password = var.replication_password
+ connection_limit = 5
+ }
2. Checkov policy — add a custom check asserting replication = true on any role whose name matches replicat*.
3. Post-deploy smoke test in your pipeline (runs against the primary after Terraform apply):
psql "$PRIMARY_DSN" -c \
"SELECT rolreplication FROM pg_authid WHERE rolname='replicator'" \
| grep -q 't' || { echo 'REPLICATION flag missing — failing pipeline'; exit 1; }
4. Prometheus alert — if pg_stat_replication has zero rows and pg_is_in_recovery() is false on the primary, fire a P1 alert. This catches this failure class even when the role issue is introduced outside Terraform (manual psql, Ansible drift).
- alert: PostgreSQLNoActiveReplica
expr: pg_stat_replication_count == 0 and pg_replication_is_replica == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Primary has no connected replicas — HA is broken"