Initializing Enclave...

How to Fix PostgreSQL 'temp file size exceeded temp_file_limit' Error

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 10 mins

TL;DR

  • What broke: A query's sort, hash join, or hash aggregate spilled more temp data to disk than temp_file_limit permits, causing PostgreSQL to hard-abort the transaction.
  • How to fix it: Raise temp_file_limit, tune work_mem to reduce spill volume, or rewrite the query to avoid the full sort/hash.
  • Use our Client-Side Sandbox below to auto-refactor this — paste your postgresql.conf or EXPLAIN ANALYZE output and get a patched config instantly.

The Incident (What Does the Error Mean?)

ERROR:  temporary file size exceeds temp_file_limit (524288kB)
CONTEXT:  SQL function "..."

PostgreSQL enforces temp_file_limit per-session. The moment a single backend's cumulative temp file writes (sort runs, hash batch files, CTE materializations) cross this threshold, the backend throws a fatal error and rolls back the entire transaction. The query is dead. Any application holding that connection gets an exception. Connection pools often retry immediately, hammering the same broken path.


The Attack Vector / Blast Radius

This is not a one-query problem. The cascading failure path:

  1. Retry storms: ORMs and pgBouncer retry failed transactions. If the data shape hasn't changed, every retry hits the same limit, burning CPU and I/O.
  2. Disk starvation: If temp_file_limit is too permissive and multiple sessions spill simultaneously, pg_default tablespace fills the OS volume. PostgreSQL crashes with ENOSPC. This is worse than the original error.
  3. Lock pile-up: Long-running queries that die mid-execution hold locks until rollback completes. Dependent writes queue behind them.
  4. Replication lag: Excessive temp I/O on the primary competes with WAL writes, spiking replication lag on standbys.

The real blast radius is disk exhaustion + connection pool saturation, not just one failed query.


How to Fix It

Basic Fix — Raise the Limit (Triage Only)

Identify current value and the offending session:

-- Check current limit
SHOW temp_file_limit;

-- Find sessions currently writing temp files
SELECT pid, usename, application_name, query, temp_files, temp_bytes
FROM pg_stat_activity
JOIN pg_stat_statements USING (queryid)  -- if pg_stat_statements enabled
WHERE state = 'active';

-- Or check pg_stat_database for aggregate temp usage
SELECT datname, temp_files, temp_bytes
FROM pg_stat_database
WHERE datname = current_database();
# postgresql.conf

- temp_file_limit = 512MB   # Too restrictive for your workload
+ temp_file_limit = 2GB     # Raise to unblock; monitor pg_stat_database.temp_bytes

⚠️ This is a triage move only. Without tuning work_mem, you are just delaying the next breach.


Enterprise Best Practice — Reduce Spill with work_mem + Query Tuning

The root fix is keeping operations in memory so they never spill at all.

# postgresql.conf

# --- Memory: reduce spill volume ---
- work_mem = 4MB            # Default; forces sort spill on any non-trivial dataset
+ work_mem = 64MB           # Tune per max_connections. Formula: RAM * 0.25 / max_connections

# --- Temp file safety ceiling ---
- temp_file_limit = -1      # DANGEROUS: unlimited, disk can fill completely
+ temp_file_limit = 4GB     # Hard ceiling per session; adjust to 2x your largest known spill

# --- Planner: prefer hash joins only when hash fits in work_mem ---
- enable_hashjoin = on      # No constraint on hash batch size awareness
+ enable_hashjoin = on      # Keep on, but work_mem increase makes single-batch hashes viable
+ enable_sort = on
+ effective_cache_size = 24GB  # Tell planner how much OS cache is available; reduces bad plans

For specific roles/queries (preferred over global changes):

-- Grant elevated work_mem only to the ETL role, not every OLTP connection
ALTER ROLE etl_user SET work_mem = '256MB';
ALTER ROLE etl_user SET temp_file_limit = '8GB';

-- Or per-session before a known heavy query
SET work_mem = '128MB';
SET temp_file_limit = '4GB';
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

Identify the worst offenders before tuning:

-- Requires pg_stat_statements
SELECT query,
       calls,
       mean_exec_time,
       temp_blks_written * 8192 / 1024 / 1024 AS temp_mb_per_exec
FROM pg_stat_statements
ORDER BY temp_blks_written DESC
LIMIT 10;

Fix the top query first. A missing index on a sort key or a cartesian join from a bad predicate will spill regardless of work_mem.


💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Alert Before Disk Fills — Prometheus + postgres_exporter

# prometheus-rules.yml
groups:
  - name: postgres_temp
    rules:
      - alert: PostgresTempFilesHigh
        expr: rate(pg_stat_database_temp_bytes[5m]) > 50000000  # 50MB/s spill rate
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High temp file write rate on {{ $labels.datname }}"

      - alert: PostgresTempDiskNearLimit
        expr: pg_stat_database_temp_bytes > 3e9  # 3GB cumulative
        for: 1m
        labels:
          severity: critical

2. Enforce Config Baselines with pgaudit + Checkov (IaC)

If you provision RDS or Cloud SQL via Terraform:

# Checkov will flag temp_file_limit = -1 as a misconfiguration
resource "aws_db_parameter_group" "postgres" {
  family = "postgres15"

  parameter {
    name  = "temp_file_limit"
    value = "4194304"  # 4GB in kB — never leave at -1 in production
  }

  parameter {
    name  = "work_mem"
    value = "65536"  # 64MB
  }
}

Run checkov -d . in your pipeline. Add a custom check:

# checkov/custom/postgres_temp_limit.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck

class PostgresTempFileLimitCheck(BaseResourceCheck):
    def __init__(self):
        super().__init__(
            name="Ensure temp_file_limit is not -1 (unlimited)",
            id="CKV_CUSTOM_PG_001",
            supported_resources=["aws_db_parameter_group"]
        )

    def scan_resource_conf(self, conf):
        for param in conf.get("parameter", []):
            if param.get("name") == "temp_file_limit":
                return param.get("value") != "-1"
        return True

3. Query Regression Gate in CI

Use pgTAP or a custom script to assert no plan uses excessive sorts:

# In your CI pipeline, after migrations
psql $DATABASE_URL -c "
  EXPLAIN (FORMAT JSON, ANALYZE false) $YOUR_CRITICAL_QUERY
" | jq '.[0]["Plan"] | .. | objects | select(."Node Type" == "Sort") | ."Sort Key"'
# Fail the build if unexpected full-table sorts appear

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →