Initializing Enclave...

How to Fix PostgreSQL 'Out of File Descriptors' Error: Tuning ulimit and max_files_per_process

Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15 mins

TL;DR

  • What broke: PostgreSQL exhausted the OS file descriptor limit. Every new connection, WAL segment open, and table file access requires an FD. When the process hits its ceiling, it cannot open any file — queries die, connections are refused, the cluster goes read-unavailable.
  • How to fix it: Raise LimitNOFILE in the systemd service unit to ≥65536, set fs.file-max at the kernel level, and tune max_files_per_process in postgresql.conf to match.
  • Shortcut: Use our Client-Side Sandbox above — paste your postgresql.conf or systemctl cat postgresql output and get auto-refactored config with zero data leaving your browser.

The Incident (What Does the Error Mean?)

Raw error from postgresql.log:

FATAL:  out of file descriptors: could not open file "base/16384/1259": %m
LOG:   server process (PID 31892) was terminated by signal 6: Aborted
DETAIL: Failed process was running: SELECT * FROM pg_class;

Or from pg_log during high-concurrency:

ERROR:  could not open relation with OID 12345: out of file descriptors
WARNING: out of file descriptors: %m; release and retry

Immediate consequence: PostgreSQL cannot open heap files, WAL segments, or socket files. Every in-flight transaction on that backend dies. If the postmaster itself hits the limit, accept() on new connections fails — the entire cluster becomes unreachable. Replication standby lag spikes immediately as WAL receiver loses its file handle.


The Attack Vector / Blast Radius

This is a cascading resource exhaustion failure, not a single-point error.

Why it happens faster than you expect:

Each PostgreSQL backend holds open FDs for:

  • Its client socket
  • Every relation (table/index) it touches in a query
  • WAL segment files
  • Shared memory control files
  • pg_stat temp files under heavy ANALYZE

With max_connections = 200 and a moderately complex OLAP query touching 40 relations, you can consume 8,000+ FDs before a single application thread complains. Default ulimit -n on most Linux distros is 1024. Default systemd LimitNOFILE for the postgres unit is often inherited at 1024 unless explicitly overridden — even if /etc/security/limits.conf is set correctly, systemd ignores PAM limits.

Blast radius:

  1. Active queries abort mid-execution → application errors, partial writes, potential data inconsistency if not wrapped in transactions.
  2. Autovacuum workers die → table bloat accumulates → query plans degrade → performance death spiral.
  3. WAL archiver cannot open next segment → replication lag → standbys diverge → failover risk.
  4. pg_dump and pg_basebackup fail silently or corrupt mid-stream.
  5. Connection poolers (PgBouncer) enter error loops, multiplying reconnect storms against the already-degraded postmaster.

How to Fix It

Step 1 — Diagnose Current FD Usage

# Check the postgres process FD limit
ps aux | grep postgres | head -1  # get PID
cat /proc/<PID>/limits | grep 'open files'

# Count currently open FDs
ls /proc/<PID>/fd | wc -l

# Kernel-wide current usage vs max
cat /proc/sys/fs/file-nr
# Output: <allocated>  <unused>  <max>

# Check systemd effective limit
systemctl show postgresql -p LimitNOFILE

Basic Fix — Raise the Limit Now (Zero-Downtime for systemd)

# Apply kernel max immediately (survives until reboot)
sudo sysctl -w fs.file-max=2097152

# Persist across reboots
echo 'fs.file-max = 2097152' | sudo tee /etc/sysctl.d/99-postgres-fd.conf
sudo sysctl -p /etc/sysctl.d/99-postgres-fd.conf

# Override the systemd unit (no full restart needed for limit bump on Linux 5.x+)
sudo systemctl edit postgresql

In the override editor:

- # (no LimitNOFILE set — inherits systemd default of 1024)
+ [Service]
+ LimitNOFILE=65536
sudo systemctl daemon-reload
sudo systemctl restart postgresql  # required for LimitNOFILE to take effect

Enterprise Best Practice — Full Stack Tuning

/etc/sysctl.d/99-postgres-fd.conf:

- # fs.file-max not set (kernel default ~100000 on small instances)
+ fs.file-max = 2097152
+ fs.nr_open = 2097152

/etc/security/limits.d/99-postgres.conf (for non-systemd or PAM sessions):

- # no postgres-specific limits
+ postgres   soft   nofile   65536
+ postgres   hard   nofile   1048576

/etc/systemd/system/postgresql.service.d/override.conf:

 [Service]
- # LimitNOFILE absent
+ LimitNOFILE=1048576

postgresql.conf:

- max_files_per_process = 1000   # default
+ max_files_per_process = 65536

- max_connections = 500          # oversized, driving FD exhaustion
+ max_connections = 100          # pair with PgBouncer for pooling

Rule of thumb: LimitNOFILEmax_connections × 10 + max_files_per_process + 1000 (headroom for WAL + autovacuum workers).


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Ansible/Terraform Enforcement

If you provision PostgreSQL hosts via Terraform + cloud-init or Ansible, gate the deployment:

- # No FD validation in provisioner
+ resource "null_resource" "postgres_fd_check" {
+   provisioner "remote-exec" {
+     inline = [
+       "[ $(cat /proc/sys/fs/file-max) -ge 2097152 ] || (echo 'FAIL: fs.file-max too low' && exit 1)",
+       "[ $(systemctl show postgresql -p LimitNOFILE | cut -d= -f2) -ge 65536 ] || (echo 'FAIL: LimitNOFILE too low' && exit 1)"
+     ]
+   }
+ }

2. Prometheus Alerting Rule

# Alert before you hit the wall, not after
- alert: PostgresFileDescriptorExhaustion
  expr: |
    process_open_fds{job="postgresql"} / process_max_fds{job="postgresql"} > 0.80
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Postgres FD usage above 80% on {{ $labels.instance }}"
    runbook: "https://your-wiki/postgres-fd-runbook"

3. Checkov / Custom OPA Policy (for Ansible playbooks)

# checkov custom check: ensure postgres systemd override sets LimitNOFILE
from checkov.common.models.enums import CheckResult
from checkov.ansible.checks.base_ansible_check import BaseAnsibleCheck

class PostgresLimitNOFILE(BaseAnsibleCheck):
    def __init__(self):
        super().__init__(
            name="Ensure LimitNOFILE >= 65536 in postgres systemd override",
            check_id="CKV_ANSIBLE_POSTGRES_001"
        )
    def check_resource_configuration(self, configuration):
        limit = configuration.get("LimitNOFILE", 0)
        return CheckResult.PASSED if int(limit) >= 65536 else CheckResult.FAILED

4. Load Test Gate in CI

Run pgbench at 2× expected peak connections in your staging pipeline. Fail the pipeline if pg_log emits out of file descriptors during the run:

pgbench -c 150 -j 8 -T 60 mydb 2>&1 | tee /tmp/pgbench.log
grep -i 'file descriptor' /tmp/pgbench.log && echo 'PIPELINE FAIL: FD exhaustion detected' && exit 1

Bottom line: This error is 100% preventable. The default OS and systemd limits are vestigial from 2005-era single-process assumptions. Any PostgreSQL deployment expecting more than ~50 concurrent connections needs these values set on day one, validated in CI, and monitored in production.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →