How to Fix PostgreSQL 'temp file size exceeded temp_file_limit' Error
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins
TL;DR
- What broke: A query's sort, hash join, or hash aggregate spilled more temp data to disk than
temp_file_limitpermits, causing PostgreSQL to hard-abort the transaction. - How to fix it: Raise
temp_file_limit, tunework_memto reduce spill volume, or rewrite the query to avoid the full sort/hash. - Use our Client-Side Sandbox below to auto-refactor this — paste your
postgresql.conforEXPLAIN ANALYZEoutput and get a patched config instantly.
The Incident (What Does the Error Mean?)
ERROR: temporary file size exceeds temp_file_limit (524288kB)
CONTEXT: SQL function "..."
PostgreSQL enforces temp_file_limit per-session. The moment a single backend's cumulative temp file writes (sort runs, hash batch files, CTE materializations) cross this threshold, the backend throws a fatal error and rolls back the entire transaction. The query is dead. Any application holding that connection gets an exception. Connection pools often retry immediately, hammering the same broken path.
The Attack Vector / Blast Radius
This is not a one-query problem. The cascading failure path:
- Retry storms: ORMs and pgBouncer retry failed transactions. If the data shape hasn't changed, every retry hits the same limit, burning CPU and I/O.
- Disk starvation: If
temp_file_limitis too permissive and multiple sessions spill simultaneously,pg_defaulttablespace fills the OS volume. PostgreSQL crashes withENOSPC. This is worse than the original error. - Lock pile-up: Long-running queries that die mid-execution hold locks until rollback completes. Dependent writes queue behind them.
- Replication lag: Excessive temp I/O on the primary competes with WAL writes, spiking replication lag on standbys.
The real blast radius is disk exhaustion + connection pool saturation, not just one failed query.
How to Fix It
Basic Fix — Raise the Limit (Triage Only)
Identify current value and the offending session:
-- Check current limit
SHOW temp_file_limit;
-- Find sessions currently writing temp files
SELECT pid, usename, application_name, query, temp_files, temp_bytes
FROM pg_stat_activity
JOIN pg_stat_statements USING (queryid) -- if pg_stat_statements enabled
WHERE state = 'active';
-- Or check pg_stat_database for aggregate temp usage
SELECT datname, temp_files, temp_bytes
FROM pg_stat_database
WHERE datname = current_database();
# postgresql.conf
- temp_file_limit = 512MB # Too restrictive for your workload
+ temp_file_limit = 2GB # Raise to unblock; monitor pg_stat_database.temp_bytes
⚠️ This is a triage move only. Without tuning
work_mem, you are just delaying the next breach.
Enterprise Best Practice — Reduce Spill with work_mem + Query Tuning
The root fix is keeping operations in memory so they never spill at all.
# postgresql.conf
# --- Memory: reduce spill volume ---
- work_mem = 4MB # Default; forces sort spill on any non-trivial dataset
+ work_mem = 64MB # Tune per max_connections. Formula: RAM * 0.25 / max_connections
# --- Temp file safety ceiling ---
- temp_file_limit = -1 # DANGEROUS: unlimited, disk can fill completely
+ temp_file_limit = 4GB # Hard ceiling per session; adjust to 2x your largest known spill
# --- Planner: prefer hash joins only when hash fits in work_mem ---
- enable_hashjoin = on # No constraint on hash batch size awareness
+ enable_hashjoin = on # Keep on, but work_mem increase makes single-batch hashes viable
+ enable_sort = on
+ effective_cache_size = 24GB # Tell planner how much OS cache is available; reduces bad plans
For specific roles/queries (preferred over global changes):
-- Grant elevated work_mem only to the ETL role, not every OLTP connection
ALTER ROLE etl_user SET work_mem = '256MB';
ALTER ROLE etl_user SET temp_file_limit = '8GB';
-- Or per-session before a known heavy query
SET work_mem = '128MB';
SET temp_file_limit = '4GB';
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;
Identify the worst offenders before tuning:
-- Requires pg_stat_statements
SELECT query,
calls,
mean_exec_time,
temp_blks_written * 8192 / 1024 / 1024 AS temp_mb_per_exec
FROM pg_stat_statements
ORDER BY temp_blks_written DESC
LIMIT 10;
Fix the top query first. A missing index on a sort key or a cartesian join from a bad predicate will spill regardless of work_mem.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Alert Before Disk Fills — Prometheus + postgres_exporter
# prometheus-rules.yml
groups:
- name: postgres_temp
rules:
- alert: PostgresTempFilesHigh
expr: rate(pg_stat_database_temp_bytes[5m]) > 50000000 # 50MB/s spill rate
for: 2m
labels:
severity: warning
annotations:
summary: "High temp file write rate on {{ $labels.datname }}"
- alert: PostgresTempDiskNearLimit
expr: pg_stat_database_temp_bytes > 3e9 # 3GB cumulative
for: 1m
labels:
severity: critical
2. Enforce Config Baselines with pgaudit + Checkov (IaC)
If you provision RDS or Cloud SQL via Terraform:
# Checkov will flag temp_file_limit = -1 as a misconfiguration
resource "aws_db_parameter_group" "postgres" {
family = "postgres15"
parameter {
name = "temp_file_limit"
value = "4194304" # 4GB in kB — never leave at -1 in production
}
parameter {
name = "work_mem"
value = "65536" # 64MB
}
}
Run checkov -d . in your pipeline. Add a custom check:
# checkov/custom/postgres_temp_limit.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class PostgresTempFileLimitCheck(BaseResourceCheck):
def __init__(self):
super().__init__(
name="Ensure temp_file_limit is not -1 (unlimited)",
id="CKV_CUSTOM_PG_001",
supported_resources=["aws_db_parameter_group"]
)
def scan_resource_conf(self, conf):
for param in conf.get("parameter", []):
if param.get("name") == "temp_file_limit":
return param.get("value") != "-1"
return True
3. Query Regression Gate in CI
Use pgTAP or a custom script to assert no plan uses excessive sorts:
# In your CI pipeline, after migrations
psql $DATABASE_URL -c "
EXPLAIN (FORMAT JSON, ANALYZE false) $YOUR_CRITICAL_QUERY
" | jq '.[0]["Plan"] | .. | objects | select(."Node Type" == "Sort") | ."Sort Key"'
# Fail the build if unexpected full-table sorts appear