Why does the SSL connection close only on long-running queries and not short ones?

Short queries complete before any network device's idle timeout fires. Long-running queries that are CPU-bound on the server produce zero network traffic for extended periods. Stateful firewalls, AWS ALBs, and NAT Gateways interpret this silence as an abandoned connection and send a TCP RST, tearing down the SSL session. The query was healthy — the network killed it.

Does setting tcp_keepalives_idle in postgresql.conf override the OS sysctl settings?

Yes, per-connection. PostgreSQL uses the libpq setsockopt() calls to override the OS defaults on each individual socket it manages. However, setting both the OS sysctl values and the postgresql.conf parameters is the correct approach — the sysctl values cover any connections or tools that bypass libpq's socket options.

We use AWS RDS — we can't edit postgresql.conf directly. How do we apply tcp_keepalives settings?

For RDS, set tcp_keepalives_idle, tcp_keepalives_interval, and tcp_keepalives_count in your RDS Parameter Group and apply it to your instance. For the load balancer timeout issue, increase the ALB idle_timeout to exceed your longest query runtime. For RDS Proxy, review the proxy's connection borrow timeout and ensure your VPC Security Groups and Network ACLs are not enforcing their own idle timeouts shorter than your query duration.

How to Fix PostgreSQL 'SSL Connection Closed Unexpectedly' on Long-Running Queries

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–30 mins

TL;DR

What broke: PostgreSQL dropped an active SSL session mid-query because a network appliance (ALB, NAT Gateway, firewall) or the server itself killed the idle-looking TCP connection before the query returned results.
How to fix it: Tune TCP keepalive settings at the OS and PostgreSQL level, increase or disable SSL renegotiation, and configure your load balancer idle timeout to exceed your longest expected query runtime.
Fast path: Use our Client-Side Sandbox below to auto-refactor your postgresql.conf or connection string — paste your config and get patched output instantly.

The Incident (What Does the Error Mean?)

FATAL:  SSL connection has been closed unexpectedly
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

This is not a query bug. The query was running fine. The TCP socket underneath the SSL session was silently torn down — by a firewall idle-timeout, an AWS ALB 60-second hard timeout, a NAT Gateway 350-second UDP/TCP timeout, or OpenSSL's built-in SSL renegotiation kicking in on long-lived connections. PostgreSQL sees the socket die and throws FATAL. Your transaction is rolled back. Any work done is lost.

The Attack Vector / Blast Radius

This is a silent data integrity risk disguised as a network error.

Long ETL jobs, pg_dump over SSL, bulk INSERT ... SELECT — all silently aborted. If your application doesn't check for this specific error and retry with proper transaction handling, you get partial writes with no alert.
Connection poolers (PgBouncer, RDS Proxy) make this worse. The pooler may hold the dead connection in its pool and hand it to the next client, causing a cascade of SSL connection closed unexpectedly errors across unrelated application threads.
AWS RDS / Aurora specific: The RDS proxy and the underlying ALB both enforce a hard 60-second idle timeout by default. A query that runs for 90 seconds with no intermediate network traffic (no rows streamed back yet) looks "idle" to the load balancer. It tears down the TCP connection. PostgreSQL never knew it was coming.
On-prem / GCP / Azure: Stateful firewalls with TCP idle timeouts of 30–300 seconds are the usual culprit. The query is CPU-bound on the server, producing no network traffic, so the firewall ages out the flow entry and sends a RST.

How to Fix It

Basic Fix — TCP Keepalive at the OS Level

On the PostgreSQL server host, reduce the TCP keepalive probe interval so the kernel sends keepalive packets before any firewall or load balancer times out the connection.

# /etc/sysctl.conf
- # net.ipv4.tcp_keepalive_time = 7200
- # net.ipv4.tcp_keepalive_intvl = 75
- # net.ipv4.tcp_keepalive_probes = 9
+ net.ipv4.tcp_keepalive_time = 60
+ net.ipv4.tcp_keepalive_intvl = 10
+ net.ipv4.tcp_keepalive_probes = 6

Apply without reboot:

sysctl -p

This makes the kernel send a keepalive probe after 60 seconds of silence, every 10 seconds, for 6 attempts — totaling 120 seconds before the kernel declares the connection dead. This keeps the TCP flow alive through most firewalls.

Enterprise Best Practice — PostgreSQL + Application Layer

1. Enable keepalives in postgresql.conf:

# postgresql.conf
- #tcp_keepalives_idle = 0
- #tcp_keepalives_interval = 0
- #tcp_keepalives_count = 0
+ tcp_keepalives_idle = 60
+ tcp_keepalives_interval = 10
+ tcp_keepalives_count = 6

2. Disable SSL renegotiation (PostgreSQL < 14 only — this was the primary culprit in older versions):

# postgresql.conf (PostgreSQL 9.x–13.x)
- #ssl_renegotiation_limit = 512MB
+ ssl_renegotiation_limit = 0

PostgreSQL 14+ removed ssl_renegotiation_limit entirely. If you're still on 12 or 13, set it to 0 to disable forced renegotiation that can interrupt long transfers.

3. Pass keepalive parameters in your connection string (application side):

# Python psycopg2 / libpq connection string
- postgresql://user:pass@host:5432/db
+ postgresql://user:pass@host:5432/db?keepalives=1&keepalives_idle=60&keepalives_interval=10&keepalives_count=6

# JDBC (Java)
- jdbc:postgresql://host:5432/db
+ jdbc:postgresql://host:5432/db?tcpKeepAlive=true&socketTimeout=0

4. AWS RDS / Aurora — fix the load balancer timeout:

If you're behind an Application Load Balancer, the default idle timeout is 60 seconds. For long-running queries this is fatal.

# Terraform — aws_lb resource
 resource "aws_lb" "main" {
   name = "app-lb"
-  idle_timeout = 60
+  idle_timeout = 4000
 }

For RDS Proxy, set ConnectionBorrowTimeout and ensure your proxy is not using MaxConnectionsPercent so aggressively that connections are recycled mid-query.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Prevention in CI/CD

1. Checkov — flag missing keepalive on RDS Proxy / ALB Terraform:

Write a custom Checkov check or use terraform-compliance to assert idle_timeout on aws_lb resources is never below your P99 query duration SLA.

# terraform-compliance feature file
Scenario: ALB idle timeout must support long-running queries
  Given I have aws_lb defined
  Then it must contain idle_timeout
  And its value must be greater than 300

2. pg_stat_activity alerting — catch long queries before they hit the timeout:

-- Alert if any query has been running > 5 minutes
SELECT pid, now() - query_start AS duration, query, state
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 minutes';

Pipe this into your Prometheus postgres_exporter custom query config and alert before the SSL timeout window closes.

3. statement_timeout as a safety net — never let runaway queries accumulate:

# postgresql.conf
- #statement_timeout = 0
+ statement_timeout = '30min'

Set this at the role level for application users, not globally, to avoid killing legitimate long-running maintenance jobs:

ALTER ROLE app_user SET statement_timeout = '30min';
ALTER ROLE etl_user SET statement_timeout = '4h';

4. Integration test — simulate the timeout in your pipeline:

Use tc (traffic control) in your CI environment to simulate a 60-second TCP idle drop and assert your application retries correctly with exponential backoff.