How to Fix PostgreSQL 'Out of Memory: Failed on Request of Size X' in Memory Context
Threat/Impact Level: CRITICAL | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: A PostgreSQL backend process attempted to allocate a chunk larger than its assigned memory context ceiling — query dies with
ERROR: out of memory, and under connection pooling this cascades into a failure storm. - How to fix it: Tune
work_mem,max_stack_depth, and rewrite the offending query to avoid unbounded sort/hash operations or leaking PL/pgSQL SPI contexts. - Use the Client-Side Sandbox above to paste your
postgresql.confand failing query — it auto-generates the refactored config and rewritten SQL without sending your data anywhere.
The Incident (What Does the Error Mean?)
Raw error from PostgreSQL logs:
ERROR: out of memory
DETAIL: Failed on request of size 1073741824 in memory context "ExecutorState".
or in PL/pgSQL:
ERROR: out of memory
DETAIL: Failed on request of size 524288000 in memory context "SPI Proc".
PostgreSQL memory contexts are per-backend, hierarchical allocators. Each query spawns child contexts (ExecutorState, SortState, HashBatch, SPI Proc, etc.) under the top-level TopMemoryContext. When a single palloc() call inside a context cannot be satisfied — either because the OS refused the allocation or the context's own ceiling is hit — Postgres throws this error and aborts the transaction immediately. The backend process itself survives, but the query is dead. Under PgBouncer or high-concurrency workloads, dozens of backends hitting this simultaneously will saturate your connection pool and cause a cascading outage.
The Attack Vector / Blast Radius
This is not a soft degradation. The failure chain is:
- Single large query (unbounded sort, hash join on a fat table,
IN (...)with 500k values, or a PL/pgSQL loop that never frees SPI context) requests a multi-hundred-MB palloc insideExecutorState. - OS-level OOM (
vm.overcommit_memorybehavior) either refuses the allocation outright or the kernel OOM killer fires and shoots the postmaster — taking down all active connections on that instance. - Connection pool exhaustion: PgBouncer retries dead connections; application threads pile up waiting; upstream services time out. A single bad query on a shared RDS or Cloud SQL instance can take down unrelated microservices sharing the same Postgres endpoint.
- Replication lag spike: If the primary is OOM-killed, standby promotion takes 30–120 seconds. Any writes during that window are lost if synchronous_commit is off.
The highest-risk patterns that trigger this:
work_memset globally too high × highmax_connections(e.g.,work_mem=512MB× 200 connections = 100 GB theoretical peak)SELECT * FROM large_table ORDER BY unindexed_col— full sort inExecutorState- Recursive CTEs without
LIMITor cycle detection - PL/pgSQL functions with
PERFORMloops that accumulate SPI sub-contexts withoutCOMMITor explicit context resets array_agg()orstring_agg()over millions of rows
How to Fix It (The Solution)
Basic Fix — Tune work_mem and Add a Query Safeguard
The most common root cause is work_mem being set too high globally or a query needing more than available. The fix is to scope work_mem per-session for heavy queries and add a statement-level memory guard.
# postgresql.conf
- work_mem = 512MB # Catastrophic at high concurrency
+ work_mem = 16MB # Safe global default
- max_connections = 200
+ max_connections = 100 # Reduce peak memory ceiling
+ # Add statement timeout as a circuit breaker
+ statement_timeout = '120s'
For a specific heavy analytical query, override per-session instead of globally:
- SET work_mem = '1GB'; -- set globally in app connection string
- SELECT * FROM orders ORDER BY created_at;
+ -- Scope only to this transaction
+ BEGIN;
+ SET LOCAL work_mem = '256MB';
+ SELECT * FROM orders ORDER BY created_at LIMIT 10000;
+ COMMIT;
Enterprise Best Practice — PL/pgSQL Context Leak Fix + Connection-Level Guardrails
If the context is SPI Proc, you have a leaking PL/pgSQL function. The fix is explicit subtransaction management and chunked processing:
-- Leaking function pattern
CREATE OR REPLACE FUNCTION process_all_events() RETURNS void AS $$
DECLARE
rec RECORD;
BEGIN
- FOR rec IN SELECT * FROM events LOOP -- loads all rows into SPI context
- PERFORM heavy_transform(rec);
- END LOOP;
END;
$$ LANGUAGE plpgsql;
+ -- Fixed: chunk with LIMIT/OFFSET and commit per batch to release SPI context
CREATE OR REPLACE FUNCTION process_all_events() RETURNS void AS $$
DECLARE
+ batch_size INT := 1000;
+ offset_val INT := 0;
+ rows_processed INT;
BEGIN
+ LOOP
+ WITH batch AS (
+ SELECT * FROM events ORDER BY id LIMIT batch_size OFFSET offset_val
+ )
+ SELECT COUNT(*) INTO rows_processed FROM (
+ SELECT heavy_transform(rec) FROM batch
+ ) t;
+ EXIT WHEN rows_processed < batch_size;
+ offset_val := offset_val + batch_size;
+ PERFORM pg_sleep(0); -- yield, allows context GC
+ END LOOP;
END;
$$ LANGUAGE plpgsql;
For RDS/Cloud SQL, enforce memory guardrails at the parameter group level:
# RDS Parameter Group (or Aurora cluster parameters)
- work_mem = 0 # 0 = use engine default (dangerously high on large instances)
+ work_mem = 16384 # 16MB in kB
+ temp_file_limit = 10485760 # 10GB max temp spill per session — prevents unbounded disk use too
+ huge_pages = try # Reduce TLB pressure on large work_mem workloads
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Enforce work_mem caps in Terraform/Pulumi IaC:
# terraform/rds_parameter_group.tf
resource "aws_db_parameter_group" "postgres" {
family = "postgres15"
- # No work_mem override — inherits instance default
+ parameter {
+ name = "work_mem"
+ value = "16384" # 16MB — enforce in code review
+ }
+ parameter {
+ name = "temp_file_limit"
+ value = "10485760"
+ }
}
2. Checkov policy to block high work_mem in IaC PRs:
# checkov custom check: check_work_mem.py
from checkov.common.models.enums import CheckResult
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
class WorkMemCheck(BaseResourceCheck):
def __init__(self):
super().__init__("Ensure work_mem <= 64MB", "CKV_CUSTOM_PG_001",
["aws_db_parameter_group"])
def scan_resource_conf(self, conf):
for param in conf.get("parameter", []):
if param.get("name") == ["work_mem"]:
if int(param["value"][0]) > 65536: # 64MB in kB
return CheckResult.FAILED
return CheckResult.PASSED
3. pg_stat_statements monitoring alert (Datadog / Prometheus):
-- Identify top memory-pressure queries before they OOM
SELECT query, calls, mean_exec_time, stddev_exec_time,
rows / calls AS avg_rows
FROM pg_stat_statements
WHERE rows / NULLIF(calls, 0) > 100000 -- queries returning 100k+ rows avg
ORDER BY mean_exec_time DESC
LIMIT 20;
Wire this query into your monitoring pipeline as a scheduled job. Alert when avg_rows > 50000 on any single statement — that's your early warning before the OOM fires in production.
4. Load test gate in CI: Run pgbench or a custom k6 Postgres script against a staging clone with production-scale data before merging schema changes or new analytical queries. Fail the pipeline if any query allocates > 256MB per EXPLAIN (ANALYZE, BUFFERS).