Does this error cause data loss or transaction rollback?

Yes. Any open transaction on that backend is rolled back immediately when the FATAL is raised. If your application did not implement idempotent retry logic, partial writes from a multi-statement transaction will be lost. Two-phase commit transactions are left in a prepared state and must be manually resolved with pg_prepared_xacts.

Will setting tcp_keepalives_idle in postgresql.conf fix the issue without a server restart?

Yes. The tcp_keepalives_* parameters in postgresql.conf are applied per-connection at socket creation time, but running SELECT pg_reload_conf() will apply them to all new connections immediately without a full restart. Existing long-lived connections will not be affected until they reconnect.

How to Fix PostgreSQL 'could not receive data from client: Connection timed out' – Root Cause & Production Fix

Q: What is the root cause of 'FATAL: could not receive data from client: Connection timed out' in PostgreSQL?

The OS returned ETIMEDOUT on the backend's recv() call because an intermediate network device — NAT gateway, load balancer, or firewall — silently dropped the TCP flow after its idle timeout expired. PostgreSQL had no keepalive probes configured to keep the connection alive, so it had no warning before the kernel gave up retransmitting.

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins

TL;DR

What broke: PostgreSQL's backend process waited for the next data packet from the client, the TCP connection went silent past the idle timeout enforced by a NAT gateway, load balancer, or the kernel, and the OS finally returned ETIMEDOUT — PostgreSQL logged FATAL and killed the session, rolling back any open transaction.
How to fix it: Tune tcp_keepalives_idle, tcp_keepalives_interval, and tcp_keepalives_count inside postgresql.conf (or the client DSN) so keepalive probes fire before the intermediate device drops the flow; simultaneously raise or match the LB idle timeout.
Shortcut: Use our Client-Side Sandbox above to paste your postgresql.conf or connection string — it auto-generates the corrected diff locally without sending your credentials anywhere.

The Incident (What does the error mean?)

Raw server log:

2024-03-15 03:17:42 UTC [28901]: FATAL:  could not receive data from client: Connection timed out
2024-03-15 03:17:42 UTC [28901]: DETAIL:  Client connection lost while waiting for data.

PostgreSQL's backend process called recv() on the client socket. The kernel waited, received no ACK or data, exhausted its retransmit window, and returned ETIMEDOUT. PostgreSQL treats this as unrecoverable: the backend exits immediately, the transaction is rolled back, any advisory locks are released, and the application gets a broken pipe on its next write. Under connection pooling (PgBouncer, RDS Proxy) this also poisons the pooled socket, triggering a pool drain cycle.

The Attack Vector / Blast Radius

This is a silent infrastructure mismatch, not a code bug, which makes it far more dangerous in production:

NAT/Load Balancer idle timeout is shorter than your longest query. AWS NLB default: 350 s. AWS ALB: 60 s. GCP Cloud SQL Proxy: 600 s. If a VACUUM ANALYZE, a bulk INSERT, or a reporting query runs longer, the LB silently drops the flow. PostgreSQL's backend has no idea until it tries to send the result — by then the client is gone.
Cascading pool exhaustion. PgBouncer holds the server-side connection open. The broken client socket is not detected until the next recv(). Meanwhile the pool slot is occupied. Under load, all pool slots fill with zombie connections → new queries queue → application latency spikes → timeouts cascade.
Data integrity risk. Any transaction that was mid-flight (multi-statement, COPY, two-phase commit) is rolled back. If the application lacks proper retry logic with idempotency keys, it may re-submit partial writes → duplicate rows or constraint violations.
Invisible in APM. The error surfaces in PostgreSQL logs, not in application error rates, because the TCP drop happens below the application layer. Teams spend hours blaming slow queries before checking pg_log.

How to Fix It

Layer 1 — PostgreSQL Server: Enable TCP Keepalives

File: /etc/postgresql/15/main/postgresql.conf

- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60        # send first probe after 60 s of silence
+ tcp_keepalives_interval = 10   # retransmit probe every 10 s
+ tcp_keepalives_count = 6       # drop after 6 missed probes (60+60s = 120s total)

No restart required — SELECT pg_reload_conf(); applies these live. These map directly to TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT socket options on the backend socket.

Layer 2 — Kernel: Harden System-Wide TCP Keepalive

File: /etc/sysctl.d/99-postgres-keepalive.conf

- # (defaults: idle=7200, interval=75, count=9)
+ net.ipv4.tcp_keepalive_time = 60
+ net.ipv4.tcp_keepalive_intvl = 10
+ net.ipv4.tcp_keepalive_probes = 6

Apply with sysctl -p /etc/sysctl.d/99-postgres-keepalive.conf. This is the fallback for any process that does not set socket-level options.

Layer 3 — Load Balancer: Match or Exceed Query Duration

Platform	Setting	Recommended Value
AWS NLB	`--load-balancer-attributes idle_timeout`	3600 s
AWS RDS Proxy	`--idle-client-timeout`	1800 s
GCP TCP LB	Backend service `timeoutSec`	3600 s
HAProxy	`timeout client` / `timeout server`	`1h`

HAProxy example:

 defaults
-    timeout client  1m
-    timeout server  1m
+    timeout client  1h
+    timeout server  1h
+    timeout tunnel  1h

Layer 4 — Application / Connection Pool: Client-Side DSN Options

Python (psycopg2/psycopg3):

- conn = psycopg2.connect("host=db dbname=app user=app password=secret")
+ conn = psycopg2.connect(
+     "host=db dbname=app user=app password=secret"
+     " keepalives=1 keepalives_idle=60 keepalives_interval=10 keepalives_count=6"
+ )

JDBC (Java):

- jdbc:postgresql://db:5432/app
+ jdbc:postgresql://db:5432/app?tcpKeepAlive=true&socketTimeout=300

PgBouncer pgbouncer.ini:

- tcp_keepalive = 0
- tcp_keepcnt = 0
- tcp_keepidle = 0
- tcp_keepintvl = 0
+ tcp_keepalive = 1
+ tcp_keepcnt = 6
+ tcp_keepidle = 60
+ tcp_keepintvl = 10
+ server_idle_timeout = 600
+ client_idle_timeout = 0

Enterprise Best Practice

Never rely on a single keepalive layer. Set it at kernel, PostgreSQL, PgBouncer, and the application DSN. Defense-in-depth means even if one layer is reset by a platform upgrade, the others hold.
Set statement_timeout and idle_in_transaction_session_timeout to bound runaway sessions independently of TCP:

+ idle_in_transaction_session_timeout = '10min'
+ statement_timeout = '30min'

Use pg_stat_activity to monitor state_change timestamps on idle in transaction connections before they become zombie sockets.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov — Flag Missing Keepalive in RDS Parameter Groups

Write a custom Checkov check (CKV_CUSTOM_PG_KEEPALIVE) that asserts tcp_keepalives_idle is set and non-zero in any aws_db_parameter_group Terraform resource.

 resource "aws_db_parameter_group" "postgres" {
   family = "postgres15"
+  parameter {
+    name  = "tcp_keepalives_idle"
+    value = "60"
+  }
+  parameter {
+    name  = "tcp_keepalives_interval"
+    value = "10"
+  }
+  parameter {
+    name  = "tcp_keepalives_count"
+    value = "6"
+  }
 }

2. OPA / Conftest Policy — Enforce LB Idle Timeout

# policy/lb_timeout.rego
package terraform.aws.lb

deny[msg] {
  res := input.resource.aws_lb_target_group[name]
  not res.config.deregistration_delay
  msg := sprintf("LB target group '%v' missing explicit deregistration_delay — verify idle timeout alignment with PostgreSQL keepalive", [name])
}

3. GitHub Actions — Lint postgresql.conf in PRs

- name: Validate postgresql.conf keepalive settings
  run: |
    grep -E '^tcp_keepalives_idle\s*=\s*[1-9]' postgresql.conf || \
      (echo '::error::tcp_keepalives_idle not set' && exit 1)
    grep -E '^tcp_keepalives_interval\s*=\s*[1-9]' postgresql.conf || \
      (echo '::error::tcp_keepalives_interval not set' && exit 1)

4. Alerting — Detect the Error Before Users Do

# Prometheus log scrape rule (promtail / Loki)
- match:
    selector: '{job="postgresql"}'
    stages:
      - regex:
          expression: 'FATAL.*could not receive data from client.*Connection timed out'
      - metrics:
          pg_client_timeout_total:
            type: Counter
            description: PostgreSQL client receive timeout events

Alert on pg_client_timeout_total > 5 in a 5-minute window → page on-call before pool exhaustion cascades.