Initializing Enclave...

How to Fix PostgreSQL 'Checkpoints Are Occurring Too Frequently' by Tuning max_wal_size

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15 mins

TL;DR

  • What broke: max_wal_size is too small for your write throughput. PostgreSQL is forcing a checkpoint before the checkpoint_completion_target window closes, causing continuous I/O spikes and WAL file churn.
  • How to fix it: Increase max_wal_size (start at 2–4 GB on any serious workload), widen checkpoint_completion_target to 0.9, and verify with pg_stat_bgwriter.
  • Fast path: Use our Client-Side Sandbox below to auto-refactor your postgresql.conf — paste it in, get a corrected diff back without sending your config to a third-party server.

The Incident (What Does the Error Mean?)

Raw log output:

LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

PostgreSQL writes all changes to the Write-Ahead Log (WAL) first. A checkpoint is the moment Postgres flushes all dirty shared_buffers pages to the actual data files and advances the WAL recycle point. Checkpoints are supposed to happen every checkpoint_timeout seconds (default: 5 min) or when WAL accumulates to max_wal_size — whichever comes first.

When this error fires, max_wal_size is being hit repeatedly before checkpoint_timeout expires. Postgres is being forced into emergency checkpoints on a tight loop. Every checkpoint means a full fsync storm against your data volume.


The Attack Vector / Blast Radius

This is a cascading I/O failure, not a single event.

  1. Write amplification: Each forced checkpoint writes all dirty pages immediately rather than spreading the I/O across the full checkpoint_completion_target window. On a high-write OLTP system, this can saturate disk bandwidth entirely.
  2. WAL segment churn: Postgres recycles WAL files after each checkpoint. Frequent checkpoints mean frequent small WAL segments, which destroys streaming replication throughput and makes logical replication slots fall behind — replica lag spikes.
  3. shared_buffers thrashing: The buffer manager cannot amortize page writes. Cache hit ratio drops as dirty evictions increase. Query latency climbs non-linearly.
  4. Autovacuum contention: Autovacuum workers compete for the same I/O bandwidth during checkpoint storms. Dead tuple bloat accelerates. Table scans slow down.
  5. Connection pile-up: Slow queries hold locks longer. Connection pool exhaustion follows within minutes on busy systems. This is how a checkpoint misconfiguration becomes a full application outage.

How to Fix It

Basic Fix

Increase max_wal_size so checkpoints are driven by checkpoint_timeout instead of WAL volume.

# postgresql.conf

- max_wal_size = 1GB
+ max_wal_size = 4GB

- checkpoint_completion_target = 0.5
+ checkpoint_completion_target = 0.9

- checkpoint_timeout = 5min
+ checkpoint_timeout = 10min

Apply without full restart (requires superuser):

ALTER SYSTEM SET max_wal_size = '4GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET checkpoint_timeout = '10min';
SELECT pg_reload_conf();

⚠️ max_wal_size is a soft limit — Postgres can exceed it briefly. It controls when a checkpoint is requested, not enforced. Setting it too high delays crash recovery time. Size it so checkpoints occur every 5–15 minutes under peak load.

Enterprise Best Practice

Tune based on measured WAL generation rate, not guesswork.

Step 1 — Measure current WAL generation rate:

-- Run twice, 60 seconds apart. Divide delta by interval.
SELECT pg_current_wal_lsn();
-- Wait 60s
SELECT pg_current_wal_lsn();
-- WAL rate (bytes/sec) = (lsn2 - lsn1) / 60

Step 2 — Inspect checkpoint frequency from pg_stat_bgwriter:

SELECT
  checkpoints_timed,
  checkpoints_req,           -- THIS should be near zero
  round(100.0 * checkpoints_req /
    NULLIF(checkpoints_timed + checkpoints_req, 0), 2) AS pct_forced,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend            -- High = backend processes doing their own writes = bad
FROM pg_stat_bgwriter;

checkpoints_req should be < 10% of total checkpoints. If it's above that, max_wal_size is still too small.

Step 3 — Right-size max_wal_size:

# Target: checkpoints driven by checkpoint_timeout, not WAL volume
# Formula: max_wal_size >= (WAL rate bytes/sec) * checkpoint_timeout_seconds * 1.2

- max_wal_size = 1GB
+ max_wal_size = 8GB          # For high-write OLTP: 4–16 GB is normal

- wal_buffers = -1            # auto-sized to 1/32 of shared_buffers, often too small
+ wal_buffers = 64MB          # Cap at 64MB; larger gives no benefit

- checkpoint_completion_target = 0.5
+ checkpoint_completion_target = 0.9   # Spread I/O across 90% of checkpoint interval

- checkpoint_timeout = 5min
+ checkpoint_timeout = 15min           # Longer window = smoother I/O

Step 4 — For RDS/Aurora/Cloud SQL: These parameters are set via parameter groups, not flat files.

# AWS RDS Parameter Group
- max_wal_size: 1024          # MB
+ max_wal_size: 8192

- checkpoint_completion_target: 0.5
+ checkpoint_completion_target: 0.9

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Enforce minimum max_wal_size in config reviews with a shell check:

#!/bin/bash
# ci/check_postgres_config.sh
MAX_WAL=$(grep -E '^max_wal_size' postgresql.conf | awk '{print $3}' | tr -d 'GB')
if [ "${MAX_WAL}" -lt 2 ]; then
  echo "FAIL: max_wal_size is ${MAX_WAL}GB. Minimum required: 2GB"
  exit 1
fi

2. Use pgbadger in your CI pipeline to catch checkpoint warnings before they hit production:

# .github/workflows/db-audit.yml
- name: Parse PostgreSQL logs for checkpoint warnings
  run: |
    pgbadger --quiet --outfile report.html /var/log/postgresql/postgresql.log
    grep -i 'checkpoints are occurring too frequently' /var/log/postgresql/postgresql.log \
      && echo '::error::Checkpoint frequency alert detected' && exit 1 || true

3. Alert on checkpoints_req via Prometheus postgres_exporter:

# prometheus/alerts/postgres.yml
- alert: PostgreSQLForcedCheckpointRatioHigh
  expr: |
    rate(pg_stat_bgwriter_checkpoints_req_total[5m]) /
    (
      rate(pg_stat_bgwriter_checkpoints_req_total[5m]) +
      rate(pg_stat_bgwriter_checkpoints_timed_total[5m])
    ) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Forced checkpoints exceed 10% — increase max_wal_size"

4. Terraform / Helm chart enforcement for RDS:

# terraform/rds_parameter_group.tf
resource "aws_db_parameter_group" "postgres" {
  parameter {
    name  = "max_wal_size"
-   value = "1024"
+   value = "8192"
  }
  parameter {
    name  = "checkpoint_completion_target"
-   value = "0.5"
+   value = "0.9"
  }
}

Run checkov -f terraform/rds_parameter_group.tf with a custom check to enforce these lower bounds before any terraform apply.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →