How to Fix PostgreSQL 'could not receive data from client: Connection timed out' – Root Cause & Production Fix
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: PostgreSQL's backend process waited for the next data packet from the client, the TCP connection went silent past the idle timeout enforced by a NAT gateway, load balancer, or the kernel, and the OS finally returned
ETIMEDOUT— PostgreSQL loggedFATALand killed the session, rolling back any open transaction. - How to fix it: Tune
tcp_keepalives_idle,tcp_keepalives_interval, andtcp_keepalives_countinsidepostgresql.conf(or the client DSN) so keepalive probes fire before the intermediate device drops the flow; simultaneously raise or match the LB idle timeout. - Shortcut: Use our Client-Side Sandbox above to paste your
postgresql.confor connection string — it auto-generates the corrected diff locally without sending your credentials anywhere.
The Incident (What does the error mean?)
Raw server log:
2024-03-15 03:17:42 UTC [28901]: FATAL: could not receive data from client: Connection timed out
2024-03-15 03:17:42 UTC [28901]: DETAIL: Client connection lost while waiting for data.
PostgreSQL's backend process called recv() on the client socket. The kernel waited, received no ACK or data, exhausted its retransmit window, and returned ETIMEDOUT. PostgreSQL treats this as unrecoverable: the backend exits immediately, the transaction is rolled back, any advisory locks are released, and the application gets a broken pipe on its next write. Under connection pooling (PgBouncer, RDS Proxy) this also poisons the pooled socket, triggering a pool drain cycle.
The Attack Vector / Blast Radius
This is a silent infrastructure mismatch, not a code bug, which makes it far more dangerous in production:
- NAT/Load Balancer idle timeout is shorter than your longest query. AWS NLB default: 350 s. AWS ALB: 60 s. GCP Cloud SQL Proxy: 600 s. If a VACUUM ANALYZE, a bulk INSERT, or a reporting query runs longer, the LB silently drops the flow. PostgreSQL's backend has no idea until it tries to send the result — by then the client is gone.
- Cascading pool exhaustion. PgBouncer holds the server-side connection open. The broken client socket is not detected until the next
recv(). Meanwhile the pool slot is occupied. Under load, all pool slots fill with zombie connections → new queries queue → application latency spikes → timeouts cascade. - Data integrity risk. Any transaction that was mid-flight (multi-statement, COPY, two-phase commit) is rolled back. If the application lacks proper retry logic with idempotency keys, it may re-submit partial writes → duplicate rows or constraint violations.
- Invisible in APM. The error surfaces in PostgreSQL logs, not in application error rates, because the TCP drop happens below the application layer. Teams spend hours blaming slow queries before checking
pg_log.
How to Fix It
Layer 1 — PostgreSQL Server: Enable TCP Keepalives
File: /etc/postgresql/15/main/postgresql.conf
- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60 # send first probe after 60 s of silence
+ tcp_keepalives_interval = 10 # retransmit probe every 10 s
+ tcp_keepalives_count = 6 # drop after 6 missed probes (60+60s = 120s total)
No restart required — SELECT pg_reload_conf(); applies these live. These map directly to TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT socket options on the backend socket.
Layer 2 — Kernel: Harden System-Wide TCP Keepalive
File: /etc/sysctl.d/99-postgres-keepalive.conf
- # (defaults: idle=7200, interval=75, count=9)
+ net.ipv4.tcp_keepalive_time = 60
+ net.ipv4.tcp_keepalive_intvl = 10
+ net.ipv4.tcp_keepalive_probes = 6
Apply with sysctl -p /etc/sysctl.d/99-postgres-keepalive.conf. This is the fallback for any process that does not set socket-level options.
Layer 3 — Load Balancer: Match or Exceed Query Duration
| Platform | Setting | Recommended Value |
|---|---|---|
| AWS NLB | --load-balancer-attributes idle_timeout |
3600 s |
| AWS RDS Proxy | --idle-client-timeout |
1800 s |
| GCP TCP LB | Backend service timeoutSec |
3600 s |
| HAProxy | timeout client / timeout server |
1h |
HAProxy example:
defaults
- timeout client 1m
- timeout server 1m
+ timeout client 1h
+ timeout server 1h
+ timeout tunnel 1h
Layer 4 — Application / Connection Pool: Client-Side DSN Options
Python (psycopg2/psycopg3):
- conn = psycopg2.connect("host=db dbname=app user=app password=secret")
+ conn = psycopg2.connect(
+ "host=db dbname=app user=app password=secret"
+ " keepalives=1 keepalives_idle=60 keepalives_interval=10 keepalives_count=6"
+ )
JDBC (Java):
- jdbc:postgresql://db:5432/app
+ jdbc:postgresql://db:5432/app?tcpKeepAlive=true&socketTimeout=300
PgBouncer pgbouncer.ini:
- tcp_keepalive = 0
- tcp_keepcnt = 0
- tcp_keepidle = 0
- tcp_keepintvl = 0
+ tcp_keepalive = 1
+ tcp_keepcnt = 6
+ tcp_keepidle = 60
+ tcp_keepintvl = 10
+ server_idle_timeout = 600
+ client_idle_timeout = 0
Enterprise Best Practice
- Never rely on a single keepalive layer. Set it at kernel, PostgreSQL, PgBouncer, and the application DSN. Defense-in-depth means even if one layer is reset by a platform upgrade, the others hold.
- Set
statement_timeoutandidle_in_transaction_session_timeoutto bound runaway sessions independently of TCP:
+ idle_in_transaction_session_timeout = '10min'
+ statement_timeout = '30min'
- Use
pg_stat_activityto monitorstate_changetimestamps onidle in transactionconnections before they become zombie sockets.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Checkov — Flag Missing Keepalive in RDS Parameter Groups
Write a custom Checkov check (CKV_CUSTOM_PG_KEEPALIVE) that asserts tcp_keepalives_idle is set and non-zero in any aws_db_parameter_group Terraform resource.
resource "aws_db_parameter_group" "postgres" {
family = "postgres15"
+ parameter {
+ name = "tcp_keepalives_idle"
+ value = "60"
+ }
+ parameter {
+ name = "tcp_keepalives_interval"
+ value = "10"
+ }
+ parameter {
+ name = "tcp_keepalives_count"
+ value = "6"
+ }
}
2. OPA / Conftest Policy — Enforce LB Idle Timeout
# policy/lb_timeout.rego
package terraform.aws.lb
deny[msg] {
res := input.resource.aws_lb_target_group[name]
not res.config.deregistration_delay
msg := sprintf("LB target group '%v' missing explicit deregistration_delay — verify idle timeout alignment with PostgreSQL keepalive", [name])
}
3. GitHub Actions — Lint postgresql.conf in PRs
- name: Validate postgresql.conf keepalive settings
run: |
grep -E '^tcp_keepalives_idle\s*=\s*[1-9]' postgresql.conf || \
(echo '::error::tcp_keepalives_idle not set' && exit 1)
grep -E '^tcp_keepalives_interval\s*=\s*[1-9]' postgresql.conf || \
(echo '::error::tcp_keepalives_interval not set' && exit 1)
4. Alerting — Detect the Error Before Users Do
# Prometheus log scrape rule (promtail / Loki)
- match:
selector: '{job="postgresql"}'
stages:
- regex:
expression: 'FATAL.*could not receive data from client.*Connection timed out'
- metrics:
pg_client_timeout_total:
type: Counter
description: PostgreSQL client receive timeout events
Alert on pg_client_timeout_total > 5 in a 5-minute window → page on-call before pool exhaustion cascades.