Why does the broken pipe error only appear after the connection has been idle for several minutes?

Cloud infrastructure (AWS NAT Gateway, Security Groups, GCP Cloud NAT, Azure Load Balancer) silently drops TCP flows that have been idle past a threshold — typically 350 seconds for AWS SGs and 120 seconds for NAT. The kernel on both ends still thinks the socket is open. The broken pipe surfaces the moment your app tries to write to that dead socket. TCP keepalives prevent this by sending small probe packets during idle periods, keeping the NAT mapping alive and detecting dead connections before a real query hits them.

Does pool_pre_ping / connection test query add significant latency overhead?

In practice, negligible. A `SELECT 1` round-trip on a local VPC is sub-millisecond. The overhead is only paid when a connection is checked out of the pool, not on every query. The cost of one broken pipe error — failed transaction, retry logic, user-facing error — vastly exceeds the cumulative cost of pre-ping checks across the lifetime of the pool.

We use PgBouncer in transaction pooling mode. Does that change the keepalive strategy?

Yes, critically. In transaction pooling mode, the client-to-PgBouncer connection and the PgBouncer-to-PostgreSQL connection are separate TCP sessions. You must configure keepalives on both hops: `tcp_keepalive`, `tcp_keepidle`, `tcp_keepintvl`, and `tcp_keepcnt` in `pgbouncer.ini` cover the PgBouncer→Postgres leg. Your application's connection string keepalive params cover the app→PgBouncer leg. Missing either hop leaves you exposed to partition-induced broken pipes.

How to Fix PostgreSQL 'Broken Pipe' After a Network Partition: Root Cause & Production-Grade Solution

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: A network partition (firewall timeout, NAT table eviction, cloud VPC blip) silently killed the TCP socket. PostgreSQL's server-side process is gone. Your app still holds a stale file descriptor and writes into the void — kernel throws EPIPE, surfaced as could not send data to server: Broken pipe.
How to fix it: Enable TCP keepalives at the connection level, configure tcp_keepalives_idle/interval/count in your connection string or postgresql.conf, and add pre-ping/validation in your connection pool so dead sockets are evicted before a query runs.
Shortcut: Use our Client-Side Sandbox above to auto-refactor your connection config or pool initialization code — paste it in, get corrected output without sending your credentials anywhere.

The Incident (What Does the Error Mean?)

Raw error output:

FATAL: could not send data to server: Broken pipe
SSL connection has been closed unexpectedly
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or after processing the request.

What actually happened: the TCP session was severed — most commonly by an AWS Security Group idle timeout (350s default), a GCP Cloud NAT mapping expiry, an on-prem firewall dropping stateful entries, or a brief AZ-level network event. The PostgreSQL backend process (postgres: user db 10.x.x.x) received a SIGTERM or simply lost the socket. Your application's connection object has no idea. It attempts to flush a query buffer, the kernel finds the socket dead, and raises SIGPIPE / EPIPE. Any transaction in flight is lost. No rollback is guaranteed to have reached the server.

The Attack Vector / Blast Radius

This is not a one-connection problem in production. Here's the cascade:

Connection pool poisoning. PgBouncer, HikariCP, SQLAlchemy pool, or pg pool in Node — all of them cache connections. After a partition event, every idle connection in the pool is stale. The next N requests (where N = pool size) will all fail with broken pipe before the pool's eviction logic kicks in — if it's configured at all.
Silent data loss. If the pipe breaks mid-COPY or mid-multi-statement transaction, the server may have executed partial work. Without synchronous_commit = on and proper client-side retry logic, you can have partial writes with no client-side error until the next flush.
Thundering herd on recovery. Once the network restores, all pool threads simultaneously detect dead connections and attempt reconnect + re-auth at the same instant. Under high concurrency this spikes pg_stat_activity past max_connections, triggering FATAL: sorry, too many clients already.
Monitoring blindspot. APM tools record the broken pipe as an application error, not a network event. Root cause gets misattributed to the app layer. The real fix — keepalives and pool validation — gets skipped.

How to Fix It

Basic Fix — Enable TCP Keepalives in the Connection String

For libpq-based clients (psql, psycopg2, asyncpg, node-postgres), these parameters map directly to kernel SO_KEEPALIVE socket options:

- postgresql://user:pass@db-host:5432/mydb
+ postgresql://user:pass@db-host:5432/mydb?keepalives=1&keepalives_idle=60&keepalives_interval=10&keepalives_count=5

What these do:

keepalives_idle=60 — send first keepalive probe after 60s of inactivity (beats AWS SG 350s timeout and most firewall idle limits)
keepalives_interval=10 — retry probe every 10s
keepalives_count=5 — declare connection dead after 5 missed probes (50s total)

Server-Side — `postgresql.conf`

- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60
+ tcp_keepalives_interval = 10
+ tcp_keepalives_count = 5

No restart required — SELECT pg_reload_conf(); is sufficient.

Enterprise Best Practice — Pool-Level Dead Connection Eviction

HikariCP (Java/Spring Boot):

- hikari.connectionTimeout=30000
- hikari.idleTimeout=600000
- hikari.maxLifetime=1800000
+ hikari.connectionTimeout=30000
+ hikari.idleTimeout=300000
+ hikari.maxLifetime=600000
+ hikari.keepaliveTime=60000
+ hikari.connectionTestQuery=SELECT 1
+ hikari.validationTimeout=5000

SQLAlchemy (Python):

  engine = create_engine(
      DATABASE_URL,
-     pool_size=10,
-     max_overflow=5
+     pool_size=10,
+     max_overflow=5,
+     pool_pre_ping=True,
+     pool_recycle=300,
+     connect_args={
+         "keepalives": 1,
+         "keepalives_idle": 60,
+         "keepalives_interval": 10,
+         "keepalives_count": 5
+     }
  )

pool_pre_ping=True issues a SELECT 1 before handing a connection to the caller. Dead connections are silently replaced. This is the single highest-impact line you can add.

PgBouncer (pgbouncer.ini):

- server_idle_timeout = 600
- client_idle_timeout = 0
- tcp_keepalive = 0
+ server_idle_timeout = 120
+ client_idle_timeout = 300
+ tcp_keepalive = 1
+ tcp_keepcnt = 5
+ tcp_keepidle = 60
+ tcp_keepintvl = 10
+ server_lifetime = 3600

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov — Scan Terraform RDS/VPC for Missing Keepalive Params

Add a custom Checkov policy to flag any aws_db_instance or application config that omits keepalive settings:

# .checkov/custom_policies/pg_keepalive.yaml
metadata:
  name: "PostgreSQL connection must enable TCP keepalives"
  id: "CKV_CUSTOM_PG_001"
  severity: HIGH
definition:
  and:
    - cond_type: "connection"
      resource_types: ["aws_db_parameter_group"]
      attribute: "parameter.tcp_keepalives_idle"
      operator: "exists"

2. OPA/Conftest — Validate App Config at Deploy Time

package postgres.connection

deny[msg] {
    input.pool_pre_ping != true
    msg := "pool_pre_ping must be enabled to survive network partitions"
}

deny[msg] {
    input.pool_recycle > 300
    msg := sprintf("pool_recycle=%d exceeds 300s; connections will outlive firewall idle timeouts", [input.pool_recycle])
}

3. Integration Test — Simulate Partition in CI

Use tc (traffic control) or Toxiproxy in your pipeline to validate retry behavior:

# docker-compose.test.yml — inject Toxiproxy between app and Postgres
# Then in CI:
toxiproxy-cli toxic add pg-proxy --type=timeout --attribute=timeout=0 --downstream
sleep 5
toxiproxy-cli toxic remove pg-proxy
# Assert: app recovered without data loss, no unhandled broken pipe exceptions

4. Alert on `pg_stat_activity` Idle Connection Spike

-- Prometheus query via postgres_exporter
pg_stat_activity_count{state="idle"} > 0.8 * pg_settings_max_connections

This fires before you hit max_connections, giving ops time to drain the pool gracefully rather than reacting to a full outage.