How to Fix PostgreSQL 'Too Many Clients Already' Error: max_connections Exceeded
Threat/Impact Level: CRITICAL | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: PostgreSQL hit its
max_connectionshard limit (default: 100). Every new connection attempt is immediately rejected with a fatal error, taking down your entire application. - How to fix it: Deploy a connection pooler (PgBouncer in
transactionmode) in front of Postgres. Do NOT just bumpmax_connections— that trades one outage for a slower memory-exhaustion death. - Fast path: Use our Client-Side Sandbox above to paste your
postgresql.confor app pool config and auto-generate the corrected PgBouncer and Postgres settings.
The Incident (What Does This Error Mean?)
Raw error output:
FATAL: sorry, too many clients already
SQLSTATE: 53300 (too_many_connections)
PostgreSQL enforces max_connections at the kernel level. When the count of active + idle client connections reaches the ceiling, the postmaster process refuses the TCP handshake entirely. There is no queue. There is no retry. The connection is dead on arrival. Every API request that needs a DB connection now returns a 500. Your app is down.
Reserved slots make it worse: PostgreSQL reserves superuser_reserved_connections (default: 3) slots exclusively for superuser logins. Your application's effective ceiling is max_connections - superuser_reserved_connections = 97 usable connections on a default install.
The Attack Vector / Blast Radius
This is a cascading resource exhaustion failure, not a one-shot event.
- Connection leak in app code (missing
finallyblock, ORM misconfiguration, forgottenclient.release()) slowly consumes slots over hours. - A traffic spike — a deploy, a cron job, a retry storm — consumes the remaining slots in seconds.
- Every microservice sharing this Postgres instance is now blind. Auth service down. Order service down. Read replicas don't help because the write primary is the one rejecting connections.
- Retry storms amplify the problem. Your load balancer's health checks fail → it marks the app unhealthy → orchestrator restarts pods → each new pod opens a fresh connection pool → you now have 10× the connections you had before the restart. This is the death spiral.
Memory consequence of blindly raising max_connections: Each Postgres backend process consumes ~5–10 MB of RAM. Setting max_connections=1000 on a 4 GB RDS instance allocates up to 10 GB for backends alone before a single query runs. You will OOM the instance and trigger a harder crash.
How to Fix It
Basic Fix: Raise the Ceiling (Buys Time Only)
Edit postgresql.conf and reload. This is a temporary mitigation, not a solution.
- max_connections = 100
+ max_connections = 200
- shared_buffers = 128MB
+ shared_buffers = 512MB
Then reload without a full restart:
SELECT pg_reload_conf();
# or
pg_ctlcluster 14 main reload
⚠️ Do not skip adjusting
shared_bufferswhen raisingmax_connections. The rule of thumb:shared_buffers= 25% of total RAM. Mismatched values cause buffer pool contention.
Enterprise Best Practice: PgBouncer in Transaction Pooling Mode
This is the correct fix. PgBouncer sits between your app and Postgres. Your app thinks it has 1000 connections. Postgres sees 20.
pgbouncer.ini — before (no pooler, direct connections):
- # No pooler. App connects directly to Postgres on port 5432.
- # 50 app instances × 5 pool_size = 250 connections → FATAL
pgbouncer.ini — after:
+ [databases]
+ myappdb = host=127.0.0.1 port=5432 dbname=myappdb
+
+ [pgbouncer]
+ listen_port = 6432
+ listen_addr = 0.0.0.0
+ auth_type = scram-sha-256
+ auth_file = /etc/pgbouncer/userlist.txt
+
+ # Transaction mode: connection returned to pool after each transaction.
+ # Required for most ORMs (Django, SQLAlchemy, ActiveRecord).
+ pool_mode = transaction
+
+ # Max connections PgBouncer holds open TO Postgres
+ default_pool_size = 20
+
+ # Max clients that can connect TO PgBouncer (your app-facing limit)
+ max_client_conn = 1000
+
+ # Queue clients when pool is exhausted instead of rejecting
+ reserve_pool_size = 5
+ reserve_pool_timeout = 3
+
+ server_idle_timeout = 600
+ client_idle_timeout = 0
Application pool config (Node.js pg / knex) — after:
const pool = new Pool({
host: 'pgbouncer-host',
- port: 5432,
+ port: 6432,
- max: 50,
+ max: 10, // per-pod; PgBouncer multiplexes across all pods
+ idleTimeoutMillis: 10000,
+ connectionTimeoutMillis: 3000,
});
postgresql.conf — tuned to match:
- max_connections = 100
+ max_connections = 50 # PgBouncer + superuser + monitoring agents only
+
- shared_buffers = 128MB
+ shared_buffers = 1GB # 25% of RAM on an 8GB instance
+
+ effective_cache_size = 3GB
+ work_mem = 16MB
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Diagnose Active Connections Right Now
Run these queries on the Postgres primary before you change anything:
-- Who is eating your connections?
SELECT client_addr, usename, application_name, state, count(*)
FROM pg_stat_activity
GROUP BY 1,2,3,4
ORDER BY count(*) DESC;
-- Are idle connections the culprit?
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
-- Kill idle connections older than 10 minutes (emergency drain)
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND state_change < NOW() - INTERVAL '10 minutes'
AND pid <> pg_backend_pid();
Prevention in CI/CD
1. Terraform / IaC: Enforce max_connections and require a pooler
# terraform/rds.tf
resource "aws_db_parameter_group" "postgres" {
family = "postgres15"
- # No max_connections override — inherits dangerous default
+ parameter {
+ name = "max_connections"
+ value = "LEAST({DBInstanceClassMemory/9531392}, 5000)"
+ apply_method = "pending-reboot"
+ }
}
2. Checkov policy: Block any Postgres deployment without a sidecar pooler label
# .checkov/custom_checks/require_pgbouncer.yaml
metadata:
name: "CKV_CUSTOM_PG_001"
id: "CKV_CUSTOM_PG_001"
category: "GENERAL_SECURITY"
definition:
and:
- cond_type: "attribute"
resource_types: ["kubernetes_deployment"]
attribute: "spec.template.metadata.labels.pgbouncer-enabled"
operator: "equals"
value: "true"
3. Alert before you hit the wall (Prometheus + Alertmanager)
# prometheus/alerts/postgres.yml
groups:
- name: postgres_connections
rules:
- alert: PostgresConnectionsHigh
expr: |
sum(pg_stat_activity_count) by (instance)
/ pg_settings_max_connections by (instance) > 0.75
for: 2m
labels:
severity: warning
annotations:
summary: "Postgres connection usage above 75% on {{ $labels.instance }}"
- alert: PostgresConnectionsCritical
expr: |
sum(pg_stat_activity_count) by (instance)
/ pg_settings_max_connections by (instance) > 0.90
for: 1m
labels:
severity: critical
4. Integration test gate: Fail the pipeline if pool_size × replica_count > max_connections × 0.8
Add a pre-deploy script that reads your Helm values and your RDS parameter group and fails the pipeline with a non-zero exit if the math doesn't work. This catches the problem at PR time, not at 3 AM.