How to Fix PostgreSQL 'Lock Wait Timeout Exceeded' Error: Root Cause & Production Playbook
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–45 mins
TL;DR
- What broke: A PostgreSQL statement hit
lock_timeoutwaiting for another transaction to release a conflicting lock — common during long-running migrations, bulk updates, or connection pool pile-ups. - How to fix it: Identify the blocking PID via
pg_locks+pg_stat_activity, kill or wait out the blocker, then harden your queries withlock_timeout,SKIP LOCKED, or non-blocking DDL patterns. - Use our Client-Side Sandbox above to paste your failing migration or query and auto-generate the refactored, lock-safe version.
The Incident (What Does the Error Mean?)
Raw error output:
ERROR: canceling statement due to lock timeout
CONTEXT: while updating tuple (0,42) in relation "orders"
SQL state: 55P03
PostgreSQL enforces a lock_timeout — when a statement cannot acquire its required lock within the configured interval, the engine does not queue it indefinitely. It cancels the statement and throws 55P03. The transaction that held the lock is untouched. Your statement is dead. In a high-concurrency OLTP system or during a schema migration on a live table, this cascades: connection pool threads pile up retrying, your application's DB pool exhausts, and you get a full service outage within seconds.
The Attack Vector / Blast Radius
This is a cascading queue collapse, not a single-query failure.
- Long-running transaction (e.g., a report query, a
BEGINleft open by a crashed worker, or anALTER TABLE) holds anAccessExclusiveLockorRowExclusiveLockon a hot table. - Every subsequent
UPDATE,SELECT FOR UPDATE, or DDL against that table queues behind the blocker. - With a short
lock_timeout, those waiters die fast — but the application retries them, re-queuing more connections. - With no
lock_timeout(default0= wait forever), your connection pool fills entirely. New requests get "connection pool exhausted" errors. The database is effectively down for writes. - If the blocker is an
ALTER TABLE(e.g.,ADD COLUMN,ADD INDEX), it holdsAccessExclusiveLock— every single read and write on the table blocks. A 10-second migration on a 50M-row table during peak traffic is a P0 incident.
Blast radius: Write unavailability → connection pool exhaustion → application-level 500s → potential primary failover if health checks trip.
How to Fix It
Step 1: Identify the Blocker Right Now
Run this immediately on the primary:
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;
If blocking_duration is in minutes and blocking_state is idle in transaction — that process leaked an open transaction. Terminate it:
SELECT pg_terminate_backend(<blocking_pid>);
Basic Fix: Set lock_timeout and statement_timeout Explicitly
Never rely on the server default (0 = infinite wait). Set at the session or transaction level:
- -- No timeout configured; statement waits forever
- UPDATE orders SET status = 'processed' WHERE id = $1;
+ -- Fail fast; let the application retry with backoff
+ SET LOCAL lock_timeout = '2s';
+ SET LOCAL statement_timeout = '10s';
+ UPDATE orders SET status = 'processed' WHERE id = $1;
Set globally in postgresql.conf or via your RDS/Aurora parameter group as a baseline:
- lock_timeout = 0
- statement_timeout = 0
+ lock_timeout = '5s'
+ statement_timeout = '30s'
Enterprise Best Practice: Non-Blocking Patterns
Pattern 1 — Queue workers with SKIP LOCKED (job queues, order processing):
- -- Causes thundering herd: all workers fight for the same rows
- SELECT * FROM job_queue
- WHERE status = 'pending'
- ORDER BY created_at
- LIMIT 10
- FOR UPDATE;
+ -- Each worker atomically claims non-contended rows only
+ SELECT * FROM job_queue
+ WHERE status = 'pending'
+ ORDER BY created_at
+ LIMIT 10
+ FOR UPDATE SKIP LOCKED;
Pattern 2 — Zero-downtime DDL migrations (the #1 cause of AccessExclusiveLock incidents):
- -- Blocks ALL reads and writes for the duration of the index build
- ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
- CREATE INDEX idx_orders_status ON orders(status);
+ -- Non-blocking index creation; table remains fully live
+ ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ; -- fast, metadata-only in PG 11+
+ CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);
+ -- Note: CONCURRENTLY cannot run inside a transaction block
Pattern 3 — Advisory locks for application-level mutual exclusion:
- -- Two cron jobs updating the same resource, causing row lock contention
- UPDATE account_balances SET balance = balance - $1 WHERE account_id = $2;
+ -- Acquire advisory lock scoped to the account ID before touching the row
+ SELECT pg_advisory_xact_lock($2); -- blocks at app level, not row level
+ UPDATE account_balances SET balance = balance - $1 WHERE account_id = $2;
+ -- Advisory lock auto-releases at transaction end
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Migration safety gate — squawk linter in your pipeline:
# .github/workflows/db-migrate.yml
- name: Lint migration for lock hazards
run: |
pip install squawk-cli
squawk migrations/$(date +%Y%m%d)_*.sql
# Fails CI on: ADD COLUMN with DEFAULT (PG<11), CREATE INDEX without CONCURRENTLY,
# ALTER TABLE rewrites, DROP COLUMN, etc.
2. Enforce lock_timeout at the connection pool level (PgBouncer / application DSN):
- DATABASE_URL=postgresql://user:pass@host/db
+ DATABASE_URL=postgresql://user:pass@host/db?options=-c%20lock_timeout%3D3s%20-c%20statement_timeout%3D20s
3. Alerting — fire before users feel it:
-- Prometheus query (postgres_exporter) or scheduled alert
SELECT count(*) FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
AND now() - query_start > interval '5 seconds';
-- Alert threshold: > 3 sessions waiting on locks for > 5s = PagerDuty page
4. Checkov / Terraform guard for RDS parameter groups:
- # No lock_timeout enforced at infrastructure level
- resource "aws_db_parameter_group" "pg" {
- family = "postgres15"
- }
+ resource "aws_db_parameter_group" "pg" {
+ family = "postgres15"
+ parameter {
+ name = "lock_timeout"
+ value = "5000" # milliseconds
+ }
+ parameter {
+ name = "statement_timeout"
+ value = "30000"
+ }
+ }
5. Load test migrations in a production-clone environment (using pg_dump + RDS snapshot restore) before running against primary. A 200ms migration on staging against 10K rows becomes a 45-second lock on production with 80M rows.