How to Fix PostgreSQL 'Broken Pipe' After a Network Partition: Root Cause & Production-Grade Solution
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: A network partition (firewall timeout, NAT table eviction, cloud VPC blip) silently killed the TCP socket. PostgreSQL's server-side process is gone. Your app still holds a stale file descriptor and writes into the void — kernel throws
EPIPE, surfaced ascould not send data to server: Broken pipe. - How to fix it: Enable TCP keepalives at the connection level, configure
tcp_keepalives_idle/interval/countin your connection string orpostgresql.conf, and add pre-ping/validation in your connection pool so dead sockets are evicted before a query runs. - Shortcut: Use our Client-Side Sandbox above to auto-refactor your connection config or pool initialization code — paste it in, get corrected output without sending your credentials anywhere.
The Incident (What Does the Error Mean?)
Raw error output:
FATAL: could not send data to server: Broken pipe
SSL connection has been closed unexpectedly
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or after processing the request.
What actually happened: the TCP session was severed — most commonly by an AWS Security Group idle timeout (350s default), a GCP Cloud NAT mapping expiry, an on-prem firewall dropping stateful entries, or a brief AZ-level network event. The PostgreSQL backend process (postgres: user db 10.x.x.x) received a SIGTERM or simply lost the socket. Your application's connection object has no idea. It attempts to flush a query buffer, the kernel finds the socket dead, and raises SIGPIPE / EPIPE. Any transaction in flight is lost. No rollback is guaranteed to have reached the server.
The Attack Vector / Blast Radius
This is not a one-connection problem in production. Here's the cascade:
- Connection pool poisoning. PgBouncer, HikariCP, SQLAlchemy pool, or
pgpool in Node — all of them cache connections. After a partition event, every idle connection in the pool is stale. The next N requests (where N = pool size) will all fail with broken pipe before the pool's eviction logic kicks in — if it's configured at all. - Silent data loss. If the pipe breaks mid-
COPYor mid-multi-statement transaction, the server may have executed partial work. Withoutsynchronous_commit = onand proper client-side retry logic, you can have partial writes with no client-side error until the next flush. - Thundering herd on recovery. Once the network restores, all pool threads simultaneously detect dead connections and attempt reconnect + re-auth at the same instant. Under high concurrency this spikes
pg_stat_activitypastmax_connections, triggeringFATAL: sorry, too many clients already. - Monitoring blindspot. APM tools record the broken pipe as an application error, not a network event. Root cause gets misattributed to the app layer. The real fix — keepalives and pool validation — gets skipped.
How to Fix It
Basic Fix — Enable TCP Keepalives in the Connection String
For libpq-based clients (psql, psycopg2, asyncpg, node-postgres), these parameters map directly to kernel SO_KEEPALIVE socket options:
- postgresql://user:pass@db-host:5432/mydb
+ postgresql://user:pass@db-host:5432/mydb?keepalives=1&keepalives_idle=60&keepalives_interval=10&keepalives_count=5
What these do:
keepalives_idle=60— send first keepalive probe after 60s of inactivity (beats AWS SG 350s timeout and most firewall idle limits)keepalives_interval=10— retry probe every 10skeepalives_count=5— declare connection dead after 5 missed probes (50s total)
Server-Side — postgresql.conf
- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60
+ tcp_keepalives_interval = 10
+ tcp_keepalives_count = 5
No restart required — SELECT pg_reload_conf(); is sufficient.
Enterprise Best Practice — Pool-Level Dead Connection Eviction
HikariCP (Java/Spring Boot):
- hikari.connectionTimeout=30000
- hikari.idleTimeout=600000
- hikari.maxLifetime=1800000
+ hikari.connectionTimeout=30000
+ hikari.idleTimeout=300000
+ hikari.maxLifetime=600000
+ hikari.keepaliveTime=60000
+ hikari.connectionTestQuery=SELECT 1
+ hikari.validationTimeout=5000
SQLAlchemy (Python):
engine = create_engine(
DATABASE_URL,
- pool_size=10,
- max_overflow=5
+ pool_size=10,
+ max_overflow=5,
+ pool_pre_ping=True,
+ pool_recycle=300,
+ connect_args={
+ "keepalives": 1,
+ "keepalives_idle": 60,
+ "keepalives_interval": 10,
+ "keepalives_count": 5
+ }
)
pool_pre_ping=True issues a SELECT 1 before handing a connection to the caller. Dead connections are silently replaced. This is the single highest-impact line you can add.
PgBouncer (pgbouncer.ini):
- server_idle_timeout = 600
- client_idle_timeout = 0
- tcp_keepalive = 0
+ server_idle_timeout = 120
+ client_idle_timeout = 300
+ tcp_keepalive = 1
+ tcp_keepcnt = 5
+ tcp_keepidle = 60
+ tcp_keepintvl = 10
+ server_lifetime = 3600
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Scan Terraform RDS/VPC for Missing Keepalive Params
Add a custom Checkov policy to flag any aws_db_instance or application config that omits keepalive settings:
# .checkov/custom_policies/pg_keepalive.yaml
metadata:
name: "PostgreSQL connection must enable TCP keepalives"
id: "CKV_CUSTOM_PG_001"
severity: HIGH
definition:
and:
- cond_type: "connection"
resource_types: ["aws_db_parameter_group"]
attribute: "parameter.tcp_keepalives_idle"
operator: "exists"
2. OPA/Conftest — Validate App Config at Deploy Time
package postgres.connection
deny[msg] {
input.pool_pre_ping != true
msg := "pool_pre_ping must be enabled to survive network partitions"
}
deny[msg] {
input.pool_recycle > 300
msg := sprintf("pool_recycle=%d exceeds 300s; connections will outlive firewall idle timeouts", [input.pool_recycle])
}
3. Integration Test — Simulate Partition in CI
Use tc (traffic control) or Toxiproxy in your pipeline to validate retry behavior:
# docker-compose.test.yml — inject Toxiproxy between app and Postgres
# Then in CI:
toxiproxy-cli toxic add pg-proxy --type=timeout --attribute=timeout=0 --downstream
sleep 5
toxiproxy-cli toxic remove pg-proxy
# Assert: app recovered without data loss, no unhandled broken pipe exceptions
4. Alert on pg_stat_activity Idle Connection Spike
-- Prometheus query via postgres_exporter
pg_stat_activity_count{state="idle"} > 0.8 * pg_settings_max_connections
This fires before you hit max_connections, giving ops time to drain the pool gracefully rather than reacting to a full outage.