How to Fix Nginx 504 Gateway Timeout: upstream timed out (110) with PHP-FPM
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause
TL;DR
- What broke: Nginx gave up waiting for PHP-FPM to respond. Either
fastcgi_read_timeoutis too short, PHP-FPM worker pools are exhausted (pm.max_childrentoo low), or a slow PHP script/DB query is blocking all workers. - How to fix it: Increase
fastcgi_read_timeoutin Nginx, tunepm.max_children/pm.max_spare_serversin the FPM pool, and profile the slow PHP script to eliminate the root blocking call. - Shortcut: Use our Client-Side Sandbox below to auto-refactor your
nginx.confandwww.conf— paste both configs and get a corrected diff without sending your hostnames or DB strings to any cloud backend.
The Incident (What Does the Error Mean?)
This is what hits your Nginx error log:
2024/07/15 03:42:17 [error] 18#18: *4821 upstream timed out (110: Connection timed out)
while connecting to upstream, client: 203.0.113.45, server: app.example.com,
request: "POST /api/checkout HTTP/1.1",
upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock",
host: "app.example.com"
Immediate consequence: Every request hitting that upstream returns a blank 504 to the end user. If this is a checkout, API, or auth endpoint, you are losing revenue and sessions in real time. Nginx did not crash — it is healthy. PHP-FPM is the bottleneck. The socket or TCP port exists, but no worker picked up the connection within fastcgi_connect_timeout, or a worker accepted it but did not respond within fastcgi_read_timeout.
Two distinct failure modes produce this identical error:
| Mode | Nginx log phrase | Cause |
|---|---|---|
| Connect timeout | while connecting to upstream |
All FPM workers busy; queue full or socket backlog exceeded |
| Read timeout | while reading response header from upstream |
Worker accepted request but PHP script ran past fastcgi_read_timeout |
Distinguish them before touching any config.
The Attack Vector / Blast Radius
This is a cascading saturation failure, not an isolated slow request.
Worker pool exhaustion: One slow query (e.g., a missing DB index, a synchronous third-party API call with no timeout) holds a PHP-FPM worker for 30–120 seconds. With
pm.max_children = 5(the Debian/Ubuntu default), five concurrent slow requests drain the entire pool. Request #6 hits the socket, getsEAGAINor a full listen backlog, and Nginx times out atfastcgi_connect_timeout(default: 60s). The 504 storm begins.OOM-triggered worker recycling: If
pm.max_requestsis unset and a memory leak exists in your application, workers balloon in RSS. The kernel OOM killer terminates FPM workers mid-request. Nginx sees a broken pipe and logs the same 110 error.Unix socket backlog overflow: The default
listen.backlog = 511in FPM is often overridden to-1(OS default, typically 128 on Linux) by misconfigured pool files. Under burst traffic, the kernel drops new connection attempts silently. Nginx getsETIMEDOUT(110) instead ofECONNREFUSED(111), making diagnosis harder.Security implication: A single unauthenticated endpoint that triggers an expensive operation (report generation, image processing, recursive DB query) becomes a low-effort application-layer DoS vector. An attacker sending 10 concurrent POST requests to
/export/csvcan pin all FPM workers and take down the entire vhost — including your login page.
How to Fix It (The Solution)
Step 0: Confirm the failure mode before changing timeouts
# Check live FPM worker status (requires pm.status_path = /status in pool)
curl -s http://127.0.0.1/status?full | grep -E 'idle|active|max children reached'
# Check if workers are actually running or all busy
watch -n1 'ps aux | grep php-fpm | grep -v grep | wc -l'
# Tail FPM slow log (must be enabled — see fix below)
tail -f /var/log/php8.2-fpm-slow.log
# Check socket backlog drops
ss -lnt | grep 9000 # or check socket path
cat /proc/net/unix | grep fpm
Basic Fix — Nginx Timeout Directives
Only do this after confirming the PHP script legitimately needs more time (e.g., a batch job). Do NOT blindly raise timeouts to mask a broken pool config.
File: /etc/nginx/conf.d/app.conf or inside your location ~ \.php$ block
location ~ \.php$ {
include fastcgi_params;
fastcgi_pass unix:/run/php/php8.2-fpm.sock;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
- fastcgi_connect_timeout 60s;
- fastcgi_send_timeout 60s;
- fastcgi_read_timeout 60s;
+ fastcgi_connect_timeout 10s;
+ fastcgi_send_timeout 120s;
+ fastcgi_read_timeout 120s;
+ fastcgi_buffering on;
+ fastcgi_buffer_size 128k;
+ fastcgi_buffers 4 256k;
+ fastcgi_busy_buffers_size 256k;
}
Note:
fastcgi_connect_timeoutshould be short (10s). If FPM can't accept a connection in 10 seconds, raising this to 120s just delays the 504 — it does not fix the pool exhaustion.fastcgi_read_timeoutis where legitimate slow scripts need headroom.
Enterprise Best Practice — PHP-FPM Pool Tuning
File: /etc/php/8.2/fpm/pool.d/www.conf
[www]
user = www-data
group = www-data
listen = /run/php/php8.2-fpm.sock
listen.owner = www-data
listen.group = www-data
-; pm = dynamic
-; pm.max_children = 5
-; pm.start_servers = 2
-; pm.min_spare_servers = 1
-; pm.max_spare_servers = 3
-; pm.max_requests = 0
+pm = dynamic
+; Formula: pm.max_children = (Total RAM - OS/other overhead) / avg PHP worker RSS
+; Example: (4096MB - 512MB) / 30MB per worker ≈ 118. Start conservative.
+pm.max_children = 50
+pm.start_servers = 10
+pm.min_spare_servers = 5
+pm.max_spare_servers = 20
+; Recycle workers after N requests to prevent memory leaks from accumulating
+pm.max_requests = 500
-; request_slowlog_timeout = 0
-; slowlog = /var/log/php8.2-fpm-slow.log
+; CRITICAL: Enable slow log to find the actual blocking script
+request_slowlog_timeout = 5s
+slowlog = /var/log/php8.2-fpm-slow.log
+request_terminate_timeout = 60s
-; listen.backlog = 511
+listen.backlog = 1024
+; Expose status endpoint for monitoring
+pm.status_path = /status
+ping.path = /ping
Calculate your actual pm.max_children safely:
# Check average RSS of current FPM workers in MB
ps --no-headers -o rss -C php-fpm8.2 | awk '{sum+=$1} END {printf "Avg RSS: %.0f MB\n", sum/NR/1024}'
# Check available RAM
free -m | awk '/^Mem:/{print "Available for FPM: " $7 " MB"}'
File: /etc/php/8.2/fpm/php.ini (FPM pool override)
-max_execution_time = 30
-memory_limit = 128M
+; Match request_terminate_timeout in www.conf — must be <= that value
+max_execution_time = 55
+memory_limit = 256M
+; Always set outbound timeouts to prevent workers blocking on external calls
+default_socket_timeout = 10
Apply and verify:
nginx -t && systemctl reload nginx
php-fpm8.2 --test && systemctl restart php8.2-fpm
# Verify worker count came up
curl -s http://127.0.0.1/status | python3 -m json.tool
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Lint FPM pool configs in your pipeline
# .github/workflows/config-lint.yml
- name: Validate PHP-FPM pool config
run: |
php-fpm8.2 --test -y ./deploy/php-fpm/www.conf
# Fail if pm.max_children is below threshold
MAX=$(grep 'pm.max_children' ./deploy/php-fpm/www.conf | awk -F= '{print $2}' | tr -d ' ')
[ "$MAX" -ge 20 ] || (echo "pm.max_children too low: $MAX" && exit 1)
2. Checkov / OPA policy for Nginx timeout hygiene
# checkov custom check: ensure fastcgi_read_timeout is set and >= 30s
import re
from checkov.common.models.enums import CheckResult
def check_fastcgi_timeout(config_content: str) -> CheckResult:
match = re.search(r'fastcgi_read_timeout\s+(\d+)', config_content)
if not match or int(match.group(1)) < 30:
return CheckResult.FAILED
return CheckResult.PASSED
3. Prometheus + Alertmanager — alert before pool exhaustion
# prometheus alert: fire at 80% worker saturation, not at 100%
- alert: PHPFPMWorkerSaturation
expr: |
(phpfpm_active_processes / phpfpm_max_children) > 0.80
for: 2m
labels:
severity: warning
annotations:
summary: "PHP-FPM pool {{ $labels.pool }} at {{ $value | humanizePercentage }} capacity"
runbook: "https://wiki.internal/runbooks/php-fpm-504"
- alert: PHPFPMMaxChildrenReached
expr: phpfpm_max_children_reached_total > 0
for: 1m
labels:
severity: critical
Use php-fpm_exporter to expose /status to Prometheus. The max children reached counter is the single most important signal — it increments every time FPM had to reject a new worker spawn, which is the direct precursor to a 504 storm.
4. Load test before deploy
# Use k6 to reproduce the saturation condition in staging
k6 run --vus 60 --duration 30s - <<'EOF'
import http from 'k6/http';
export default function () {
http.post('https://staging.example.com/api/checkout', JSON.stringify({test: true}), {
headers: { 'Content-Type': 'application/json' },
timeout: '10s',
});
}
EOF
# If p99 latency spikes and error rate > 1%, your pool config is undersized for prod traffic.
5. Terraform / Ansible: enforce config-as-code for FPM pool values
# terraform variable validation — prevent deploying with default pool sizes
variable "fpm_max_children" {
type = number
validation {
condition = var.fpm_max_children >= 20
error_message = "pm.max_children must be >= 20. Default of 5 causes 504s under any real load."
}
}