Initializing Enclave...

How to Fix PostgreSQL 'could not receive data from client: Connection timed out' – Root Cause & Production Fix

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

  • What broke: PostgreSQL's backend process waited for the next data packet from the client, the TCP connection went silent past the idle timeout enforced by a NAT gateway, load balancer, or the kernel, and the OS finally returned ETIMEDOUT — PostgreSQL logged FATAL and killed the session, rolling back any open transaction.
  • How to fix it: Tune tcp_keepalives_idle, tcp_keepalives_interval, and tcp_keepalives_count inside postgresql.conf (or the client DSN) so keepalive probes fire before the intermediate device drops the flow; simultaneously raise or match the LB idle timeout.
  • Shortcut: Use our Client-Side Sandbox above to paste your postgresql.conf or connection string — it auto-generates the corrected diff locally without sending your credentials anywhere.

The Incident (What does the error mean?)

Raw server log:

2024-03-15 03:17:42 UTC [28901]: FATAL:  could not receive data from client: Connection timed out
2024-03-15 03:17:42 UTC [28901]: DETAIL:  Client connection lost while waiting for data.

PostgreSQL's backend process called recv() on the client socket. The kernel waited, received no ACK or data, exhausted its retransmit window, and returned ETIMEDOUT. PostgreSQL treats this as unrecoverable: the backend exits immediately, the transaction is rolled back, any advisory locks are released, and the application gets a broken pipe on its next write. Under connection pooling (PgBouncer, RDS Proxy) this also poisons the pooled socket, triggering a pool drain cycle.


The Attack Vector / Blast Radius

This is a silent infrastructure mismatch, not a code bug, which makes it far more dangerous in production:

  1. NAT/Load Balancer idle timeout is shorter than your longest query. AWS NLB default: 350 s. AWS ALB: 60 s. GCP Cloud SQL Proxy: 600 s. If a VACUUM ANALYZE, a bulk INSERT, or a reporting query runs longer, the LB silently drops the flow. PostgreSQL's backend has no idea until it tries to send the result — by then the client is gone.
  2. Cascading pool exhaustion. PgBouncer holds the server-side connection open. The broken client socket is not detected until the next recv(). Meanwhile the pool slot is occupied. Under load, all pool slots fill with zombie connections → new queries queue → application latency spikes → timeouts cascade.
  3. Data integrity risk. Any transaction that was mid-flight (multi-statement, COPY, two-phase commit) is rolled back. If the application lacks proper retry logic with idempotency keys, it may re-submit partial writes → duplicate rows or constraint violations.
  4. Invisible in APM. The error surfaces in PostgreSQL logs, not in application error rates, because the TCP drop happens below the application layer. Teams spend hours blaming slow queries before checking pg_log.

How to Fix It

Layer 1 — PostgreSQL Server: Enable TCP Keepalives

File: /etc/postgresql/15/main/postgresql.conf

- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60        # send first probe after 60 s of silence
+ tcp_keepalives_interval = 10   # retransmit probe every 10 s
+ tcp_keepalives_count = 6       # drop after 6 missed probes (60+60s = 120s total)

No restart required — SELECT pg_reload_conf(); applies these live. These map directly to TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT socket options on the backend socket.

Layer 2 — Kernel: Harden System-Wide TCP Keepalive

File: /etc/sysctl.d/99-postgres-keepalive.conf

- # (defaults: idle=7200, interval=75, count=9)
+ net.ipv4.tcp_keepalive_time = 60
+ net.ipv4.tcp_keepalive_intvl = 10
+ net.ipv4.tcp_keepalive_probes = 6

Apply with sysctl -p /etc/sysctl.d/99-postgres-keepalive.conf. This is the fallback for any process that does not set socket-level options.

Layer 3 — Load Balancer: Match or Exceed Query Duration

Platform Setting Recommended Value
AWS NLB --load-balancer-attributes idle_timeout 3600 s
AWS RDS Proxy --idle-client-timeout 1800 s
GCP TCP LB Backend service timeoutSec 3600 s
HAProxy timeout client / timeout server 1h

HAProxy example:

 defaults
-    timeout client  1m
-    timeout server  1m
+    timeout client  1h
+    timeout server  1h
+    timeout tunnel  1h

Layer 4 — Application / Connection Pool: Client-Side DSN Options

Python (psycopg2/psycopg3):

- conn = psycopg2.connect("host=db dbname=app user=app password=secret")
+ conn = psycopg2.connect(
+     "host=db dbname=app user=app password=secret"
+     " keepalives=1 keepalives_idle=60 keepalives_interval=10 keepalives_count=6"
+ )

JDBC (Java):

- jdbc:postgresql://db:5432/app
+ jdbc:postgresql://db:5432/app?tcpKeepAlive=true&socketTimeout=300

PgBouncer pgbouncer.ini:

- tcp_keepalive = 0
- tcp_keepcnt = 0
- tcp_keepidle = 0
- tcp_keepintvl = 0
+ tcp_keepalive = 1
+ tcp_keepcnt = 6
+ tcp_keepidle = 60
+ tcp_keepintvl = 10
+ server_idle_timeout = 600
+ client_idle_timeout = 0

Enterprise Best Practice

  • Never rely on a single keepalive layer. Set it at kernel, PostgreSQL, PgBouncer, and the application DSN. Defense-in-depth means even if one layer is reset by a platform upgrade, the others hold.
  • Set statement_timeout and idle_in_transaction_session_timeout to bound runaway sessions independently of TCP:
+ idle_in_transaction_session_timeout = '10min'
+ statement_timeout = '30min'
  • Use pg_stat_activity to monitor state_change timestamps on idle in transaction connections before they become zombie sockets.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov — Flag Missing Keepalive in RDS Parameter Groups

Write a custom Checkov check (CKV_CUSTOM_PG_KEEPALIVE) that asserts tcp_keepalives_idle is set and non-zero in any aws_db_parameter_group Terraform resource.

 resource "aws_db_parameter_group" "postgres" {
   family = "postgres15"
+  parameter {
+    name  = "tcp_keepalives_idle"
+    value = "60"
+  }
+  parameter {
+    name  = "tcp_keepalives_interval"
+    value = "10"
+  }
+  parameter {
+    name  = "tcp_keepalives_count"
+    value = "6"
+  }
 }

2. OPA / Conftest Policy — Enforce LB Idle Timeout

# policy/lb_timeout.rego
package terraform.aws.lb

deny[msg] {
  res := input.resource.aws_lb_target_group[name]
  not res.config.deregistration_delay
  msg := sprintf("LB target group '%v' missing explicit deregistration_delay — verify idle timeout alignment with PostgreSQL keepalive", [name])
}

3. GitHub Actions — Lint postgresql.conf in PRs

- name: Validate postgresql.conf keepalive settings
  run: |
    grep -E '^tcp_keepalives_idle\s*=\s*[1-9]' postgresql.conf || \
      (echo '::error::tcp_keepalives_idle not set' && exit 1)
    grep -E '^tcp_keepalives_interval\s*=\s*[1-9]' postgresql.conf || \
      (echo '::error::tcp_keepalives_interval not set' && exit 1)

4. Alerting — Detect the Error Before Users Do

# Prometheus log scrape rule (promtail / Loki)
- match:
    selector: '{job="postgresql"}'
    stages:
      - regex:
          expression: 'FATAL.*could not receive data from client.*Connection timed out'
      - metrics:
          pg_client_timeout_total:
            type: Counter
            description: PostgreSQL client receive timeout events

Alert on pg_client_timeout_total > 5 in a 5-minute window → page on-call before pool exhaustion cascades.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →