Fixing PostgreSQL 'Connection Reset by Peer' After idle_in_transaction_session_timeout
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: PostgreSQL's
idle_in_transaction_session_timeoutfired, killing a backend session that held an open transaction too long without executing a query. The TCP connection was torn down server-side; the client gotConnection reset by peer. - How to fix it: Audit your connection pool config and application transaction boundaries. Ensure transactions are committed/rolled back promptly. Tune
idle_in_transaction_session_timeoutto a safe non-zero value and fix the code holding transactions open. - Use our Client-Side Sandbox below to paste your
postgresql.conf,pgbouncer.ini, or application DB config and auto-refactor it for safe transaction handling.
The Incident (What Does the Error Mean?)
Raw error output from PostgreSQL logs:
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
FATAL: terminating connection due to idle-in-transaction timeout
PostgreSQL enforces idle_in_transaction_session_timeout (introduced in PG 9.6). When a session opens a transaction (BEGIN) and then goes silent — no queries, no commit, no rollback — for longer than the configured interval, the server forcibly terminates the backend process. The OS resets the TCP socket. The client library sees Connection reset by peer.
Immediate consequences:
- The open transaction is rolled back entirely. Any uncommitted writes are lost.
- The application connection is dead. If your pool doesn't detect this, the next query on that connection throws a broken pipe or similar error.
- Under high concurrency, this cascades: multiple connections die simultaneously, the pool exhausts, and the application starts refusing requests.
The Attack Vector / Blast Radius
This is a cascading availability failure, not a security exploit — but the blast radius is severe in production:
Connection pool poisoning. PgBouncer, HikariCP, and SQLAlchemy pools cache connections. A server-side reset is not always detected immediately. The pool hands out a dead connection to the next thread. That thread throws an exception. If retry logic is absent or naive, the request fails silently.
Lock amplification. The killed session was likely holding row-level or table-level locks. While PG releases locks on transaction termination, any dependent queries were blocked for the full timeout duration before that. Under load, this queues up hundreds of waiting backends.
Data integrity risk. If the application assumes a transaction committed (e.g., it already returned HTTP 200 to the user), but the timeout killed it mid-flight, you have a phantom write — the user believes the operation succeeded, the DB has no record of it.
Root cause is almost always application-layer: a background job that opens a transaction, calls an external HTTP API mid-transaction, and waits. The HTTP call stalls. The transaction sits idle. PG kills it.
How to Fix It
Basic Fix — Set a Sane Timeout and Fix the Immediate Config
If idle_in_transaction_session_timeout is set to 0 (disabled), you have no protection and sessions leak indefinitely. If it's set too aggressively low (e.g., 1s), legitimate slow operations get killed.
postgresql.conf:
- idle_in_transaction_session_timeout = 0
+ idle_in_transaction_session_timeout = 30000 # 30 seconds — adjust per workload
+ idle_session_timeout = 600000 # 10 min — kill truly idle sessions too
Reload without restart: SELECT pg_reload_conf();
Enterprise Best Practice — Fix the Application Transaction Boundaries
The real fix is in the application. Never call external services, sleep, or do I/O inside an open transaction.
Python (SQLAlchemy) — Bad Pattern:
- with session.begin():
- record = session.query(Order).get(order_id)
- response = requests.post("https://payment-api.internal/charge", json={...}) # BLOCKS HERE
- record.status = "paid"
Fixed Pattern:
+ # Do all external I/O BEFORE opening the transaction
+ response = requests.post("https://payment-api.internal/charge", json={...}, timeout=10)
+ response.raise_for_status()
+
+ with session.begin():
+ record = session.query(Order).get(order_id)
+ record.status = "paid"
+ # Transaction is open for milliseconds, not seconds
PgBouncer — Ensure statement-level or transaction-level pooling is configured to handle server-side resets:
- pool_mode = session
+ pool_mode = transaction
- server_idle_timeout = 600
+ server_idle_timeout = 60
+ server_reset_query = DISCARD ALL
+ tcp_keepalive = 1
+ tcp_keepidle = 10
+ tcp_keepintvl = 10
HikariCP (Java) — Add connection validation:
- hikari.connectionTimeout=30000
- hikari.idleTimeout=600000
+ hikari.connectionTimeout=30000
+ hikari.idleTimeout=60000 # Must be < idle_in_transaction_session_timeout
+ hikari.keepaliveTime=30000
+ hikari.connectionTestQuery=SELECT 1
+ hikari.validationTimeout=5000
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint postgresql.conf in Terraform/Ansible:
- # No timeout set — silent misconfiguration
- resource "aws_db_parameter_group" "pg" {
- parameter { name = "idle_in_transaction_session_timeout" value = "0" }
- }
+ resource "aws_db_parameter_group" "pg" {
+ parameter { name = "idle_in_transaction_session_timeout" value = "30000" apply_method = "immediate" }
+ parameter { name = "idle_session_timeout" value = "600000" apply_method = "immediate" }
+ }
2. Checkov policy — flag disabled timeouts:
Add a custom Checkov check or use conftest with OPA:
deny[msg] {
input.resource.aws_db_parameter_group[_].parameter[p]
p.name == "idle_in_transaction_session_timeout"
p.value == "0"
msg := "idle_in_transaction_session_timeout must not be 0 (disabled). Set to 30000ms or lower."
}
3. Integration test gate: Add a test that opens a transaction, sleeps past your configured timeout, and asserts the pool correctly recovers (reconnects) rather than serving a dead connection. Run this in your staging pipeline on every deploy.
4. Observability: Alert on pg_stat_activity where state = 'idle in transaction' and now() - state_change > interval '10 seconds'. This fires before PG kills the session and gives you time to identify the offending query pattern.
SELECT pid, usename, application_name, state, now() - state_change AS idle_duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - state_change > interval '10 seconds'
ORDER BY idle_duration DESC;
Set this as a Datadog, Prometheus, or CloudWatch custom metric. Page on it.