How to Fix PostgreSQL Autovacuum Blocking DDL: Resolving 'canceling autovacuum task' Errors
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15 mins
TL;DR
- What broke: Autovacuum held a
ShareUpdateExclusiveLockon a table; your DDL (ALTER TABLE,CREATE INDEX, etc.) needed a conflictingAccessExclusiveLock— PostgreSQL canceled autovacuum to let DDL proceed, but on large tables this repeats indefinitely, blocking your migration. - How to fix it: Tune
autovacuum_vacuum_cost_delayto make autovacuum finish faster, uselock_timeouton your DDL session, and schedule DDL during low-traffic windows withstatement_timeoutguards. - Shortcut: Use our Client-Side Sandbox above to paste your
postgresql.confor failing migration SQL — it will auto-generate the corrected config and DDL with proper lock timeout guards.
The Incident (What does the error mean?)
Raw log output from postgresql.log:
2024-01-15 03:42:17 UTC [12345]: LOG: automatic vacuum of table "prod_db.public.orders": index scans: 1
2024-01-15 03:42:31 UTC [12346]: ERROR: canceling autovacuum task
2024-01-15 03:42:31 UTC [12346]: CONTEXT: while vacuuming relation "public.orders"
2024-01-15 03:42:31 UTC [99871]: LOG: process 99871 acquired AccessExclusiveLock on relation 16432 after 32847 ms
What just happened: your ALTER TABLE orders ADD COLUMN ... sat in lock wait for 32 seconds, PostgreSQL's lock manager detected the autovacuum conflict, killed the autovacuum worker, and granted your DDL the lock. On a 500M-row table, autovacuum restarts from scratch every time this happens. If your deployment pipeline fires DDL repeatedly, autovacuum never completes — dead tuple bloat accumulates, query plans degrade, and you march toward transaction ID wraparound (forced shutdown at ~2.1B XIDs).
The Attack Vector / Blast Radius
This is not a one-time hiccup. The cascading failure path:
- DDL migration runs → autovacuum canceled → dead tuples accumulate on
orders. - Table bloat grows → sequential scans read more 8KB pages → query latency increases 3-10x.
pg_stat_user_tables.n_dead_tupspikes → autovacuum re-triggers constantly → I/O contention on shared buffers.age(relfrozenxid)on the table climbs. At 200M XIDs from wraparound limit, PostgreSQL enters autovacuum emergency mode (ignoresautovacuum_vacuum_cost_delay). At the limit: forced read-only shutdown of the entire cluster.- In RDS/Aurora environments: AWS will force-restart your instance with zero warning once wraparound is imminent.
Check your current wraparound risk right now:
SELECT relname,
age(relfrozenxid) AS xid_age,
2100000000 - age(relfrozenxid) AS xids_remaining
FROM pg_class
WHERE relkind = 'r'
ORDER BY xid_age DESC
LIMIT 10;
If xid_age exceeds 1.5 billion on any table, treat this as a P0 incident.
How to Fix It (The Solution)
Basic Fix — Add lock_timeout to Your DDL Session
Stop your DDL from holding an indefinite lock wait that starves autovacuum. Set a timeout, catch the failure, retry with backoff.
-- Migration script (Flyway/Liquibase/raw psql)
- ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
+ SET lock_timeout = '2s';
+ SET statement_timeout = '30s';
+ ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
If the DDL fails with ERROR: canceling statement due to lock timeout, your migration runner retries after a backoff interval instead of holding the lock and starving autovacuum repeatedly.
Enterprise Best Practice — Zero-Downtime DDL + Autovacuum Tuning
Step 1: Make autovacuum finish faster on hot tables (postgresql.conf or per-table storage params):
# postgresql.conf — global tuning
- autovacuum_vacuum_cost_delay = 20ms # default: too slow for large tables
- autovacuum_vacuum_scale_factor = 0.2 # triggers at 20% dead tuples
+ autovacuum_vacuum_cost_delay = 2ms # faster vacuum passes
+ autovacuum_vacuum_scale_factor = 0.01 # trigger earlier on large tables
+ autovacuum_vacuum_cost_limit = 800 # allow more I/O budget per round
Step 2: Per-table override for your critical tables (no restart required):
- -- No per-table autovacuum config (inherits slow global defaults)
+ ALTER TABLE orders SET (
+ autovacuum_vacuum_scale_factor = 0.01,
+ autovacuum_vacuum_cost_delay = 2,
+ autovacuum_analyze_scale_factor = 0.005
+ );
Step 3: Use pg_repack or CREATE INDEX CONCURRENTLY for heavy DDL to avoid AccessExclusiveLock entirely:
- CREATE INDEX idx_orders_user_id ON orders(user_id);
+ CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
-- CONCURRENTLY uses ShareUpdateExclusiveLock — compatible with autovacuum.
-- Cannot run inside a transaction block. Run standalone.
Step 4: For ALTER TABLE that requires AccessExclusiveLock, use the retry loop pattern:
DO $$
DECLARE
retries INT := 0;
BEGIN
LOOP
BEGIN
SET LOCAL lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
EXIT; -- success
EXCEPTION WHEN lock_not_available THEN
retries := retries + 1;
IF retries > 10 THEN RAISE; END IF;
PERFORM pg_sleep(5 * retries); -- exponential-ish backoff
END;
END LOOP;
END;
$$;
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Block dangerous DDL patterns in migration linters.
Use squawk — a PostgreSQL migration linter — in your CI pipeline:
# .github/workflows/db-lint.yml
- name: Lint migrations
run: |
squawk migrations/*.sql \
--exclude=prefer-text-field \
--pg-version=14.0
# Squawk flags: missing CONCURRENTLY, missing lock_timeout, adding NOT NULL without default
2. Assert lock_timeout is set before every migration run in your Flyway/Liquibase beforeMigrate callback:
-- flyway/callbacks/beforeMigrate.sql
SET lock_timeout = '5s';
SET statement_timeout = '120s';
SET idle_in_transaction_session_timeout = '30s';
3. Monitor wraparound headroom in Prometheus/Datadog:
-- Expose as a custom metric
SELECT max(age(relfrozenxid)) AS max_xid_age
FROM pg_class
WHERE relkind = 'r';
-- Alert threshold: > 1,000,000,000 = WARNING. > 1,500,000,000 = CRITICAL.
4. Terraform RDS module — enforce autovacuum parameters at provisioning time:
resource "aws_db_parameter_group" "postgres" {
family = "postgres14"
+ parameter {
+ name = "autovacuum_vacuum_cost_delay"
+ value = "2"
+ }
+ parameter {
+ name = "autovacuum_vacuum_scale_factor"
+ value = "0.01"
+ }
+ parameter {
+ name = "autovacuum_vacuum_cost_limit"
+ value = "800"
+ }
}