Initializing Enclave...

How to Fix Nginx 504 Gateway Timeout: upstream timed out (110) with PHP-FPM

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on root cause


TL;DR

  • What broke: Nginx gave up waiting for PHP-FPM to respond. Either fastcgi_read_timeout is too short, PHP-FPM worker pools are exhausted (pm.max_children too low), or a slow PHP script/DB query is blocking all workers.
  • How to fix it: Increase fastcgi_read_timeout in Nginx, tune pm.max_children / pm.max_spare_servers in the FPM pool, and profile the slow PHP script to eliminate the root blocking call.
  • Shortcut: Use our Client-Side Sandbox below to auto-refactor your nginx.conf and www.conf — paste both configs and get a corrected diff without sending your hostnames or DB strings to any cloud backend.

The Incident (What Does the Error Mean?)

This is what hits your Nginx error log:

2024/07/15 03:42:17 [error] 18#18: *4821 upstream timed out (110: Connection timed out)
  while connecting to upstream, client: 203.0.113.45, server: app.example.com,
  request: "POST /api/checkout HTTP/1.1",
  upstream: "fastcgi://unix:/run/php/php8.2-fpm.sock",
  host: "app.example.com"

Immediate consequence: Every request hitting that upstream returns a blank 504 to the end user. If this is a checkout, API, or auth endpoint, you are losing revenue and sessions in real time. Nginx did not crash — it is healthy. PHP-FPM is the bottleneck. The socket or TCP port exists, but no worker picked up the connection within fastcgi_connect_timeout, or a worker accepted it but did not respond within fastcgi_read_timeout.

Two distinct failure modes produce this identical error:

Mode Nginx log phrase Cause
Connect timeout while connecting to upstream All FPM workers busy; queue full or socket backlog exceeded
Read timeout while reading response header from upstream Worker accepted request but PHP script ran past fastcgi_read_timeout

Distinguish them before touching any config.


The Attack Vector / Blast Radius

This is a cascading saturation failure, not an isolated slow request.

  1. Worker pool exhaustion: One slow query (e.g., a missing DB index, a synchronous third-party API call with no timeout) holds a PHP-FPM worker for 30–120 seconds. With pm.max_children = 5 (the Debian/Ubuntu default), five concurrent slow requests drain the entire pool. Request #6 hits the socket, gets EAGAIN or a full listen backlog, and Nginx times out at fastcgi_connect_timeout (default: 60s). The 504 storm begins.

  2. OOM-triggered worker recycling: If pm.max_requests is unset and a memory leak exists in your application, workers balloon in RSS. The kernel OOM killer terminates FPM workers mid-request. Nginx sees a broken pipe and logs the same 110 error.

  3. Unix socket backlog overflow: The default listen.backlog = 511 in FPM is often overridden to -1 (OS default, typically 128 on Linux) by misconfigured pool files. Under burst traffic, the kernel drops new connection attempts silently. Nginx gets ETIMEDOUT (110) instead of ECONNREFUSED (111), making diagnosis harder.

  4. Security implication: A single unauthenticated endpoint that triggers an expensive operation (report generation, image processing, recursive DB query) becomes a low-effort application-layer DoS vector. An attacker sending 10 concurrent POST requests to /export/csv can pin all FPM workers and take down the entire vhost — including your login page.


How to Fix It (The Solution)

Step 0: Confirm the failure mode before changing timeouts

# Check live FPM worker status (requires pm.status_path = /status in pool)
curl -s http://127.0.0.1/status?full | grep -E 'idle|active|max children reached'

# Check if workers are actually running or all busy
watch -n1 'ps aux | grep php-fpm | grep -v grep | wc -l'

# Tail FPM slow log (must be enabled — see fix below)
tail -f /var/log/php8.2-fpm-slow.log

# Check socket backlog drops
ss -lnt | grep 9000   # or check socket path
cat /proc/net/unix | grep fpm

Basic Fix — Nginx Timeout Directives

Only do this after confirming the PHP script legitimately needs more time (e.g., a batch job). Do NOT blindly raise timeouts to mask a broken pool config.

File: /etc/nginx/conf.d/app.conf or inside your location ~ \.php$ block

 location ~ \.php$ {
     include        fastcgi_params;
     fastcgi_pass   unix:/run/php/php8.2-fpm.sock;
     fastcgi_index  index.php;
     fastcgi_param  SCRIPT_FILENAME $document_root$fastcgi_script_name;

-    fastcgi_connect_timeout  60s;
-    fastcgi_send_timeout     60s;
-    fastcgi_read_timeout     60s;
+    fastcgi_connect_timeout  10s;
+    fastcgi_send_timeout     120s;
+    fastcgi_read_timeout     120s;
+    fastcgi_buffering        on;
+    fastcgi_buffer_size      128k;
+    fastcgi_buffers          4 256k;
+    fastcgi_busy_buffers_size 256k;
 }

Note: fastcgi_connect_timeout should be short (10s). If FPM can't accept a connection in 10 seconds, raising this to 120s just delays the 504 — it does not fix the pool exhaustion. fastcgi_read_timeout is where legitimate slow scripts need headroom.


Enterprise Best Practice — PHP-FPM Pool Tuning

File: /etc/php/8.2/fpm/pool.d/www.conf

 [www]
 user = www-data
 group = www-data
 listen = /run/php/php8.2-fpm.sock
 listen.owner = www-data
 listen.group = www-data

-; pm = dynamic
-; pm.max_children = 5
-; pm.start_servers = 2
-; pm.min_spare_servers = 1
-; pm.max_spare_servers = 3
-; pm.max_requests = 0
+pm = dynamic
+; Formula: pm.max_children = (Total RAM - OS/other overhead) / avg PHP worker RSS
+; Example: (4096MB - 512MB) / 30MB per worker ≈ 118. Start conservative.
+pm.max_children = 50
+pm.start_servers = 10
+pm.min_spare_servers = 5
+pm.max_spare_servers = 20
+; Recycle workers after N requests to prevent memory leaks from accumulating
+pm.max_requests = 500

-; request_slowlog_timeout = 0
-; slowlog = /var/log/php8.2-fpm-slow.log
+; CRITICAL: Enable slow log to find the actual blocking script
+request_slowlog_timeout = 5s
+slowlog = /var/log/php8.2-fpm-slow.log
+request_terminate_timeout = 60s

-; listen.backlog = 511
+listen.backlog = 1024

+; Expose status endpoint for monitoring
+pm.status_path = /status
+ping.path = /ping

Calculate your actual pm.max_children safely:

# Check average RSS of current FPM workers in MB
ps --no-headers -o rss -C php-fpm8.2 | awk '{sum+=$1} END {printf "Avg RSS: %.0f MB\n", sum/NR/1024}'

# Check available RAM
free -m | awk '/^Mem:/{print "Available for FPM: " $7 " MB"}'

File: /etc/php/8.2/fpm/php.ini (FPM pool override)

-max_execution_time = 30
-memory_limit = 128M
+; Match request_terminate_timeout in www.conf — must be <= that value
+max_execution_time = 55
+memory_limit = 256M
+; Always set outbound timeouts to prevent workers blocking on external calls
+default_socket_timeout = 10

Apply and verify:

nginx -t && systemctl reload nginx
php-fpm8.2 --test && systemctl restart php8.2-fpm

# Verify worker count came up
curl -s http://127.0.0.1/status | python3 -m json.tool

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Lint FPM pool configs in your pipeline

# .github/workflows/config-lint.yml
- name: Validate PHP-FPM pool config
  run: |
    php-fpm8.2 --test -y ./deploy/php-fpm/www.conf
    # Fail if pm.max_children is below threshold
    MAX=$(grep 'pm.max_children' ./deploy/php-fpm/www.conf | awk -F= '{print $2}' | tr -d ' ')
    [ "$MAX" -ge 20 ] || (echo "pm.max_children too low: $MAX" && exit 1)

2. Checkov / OPA policy for Nginx timeout hygiene

# checkov custom check: ensure fastcgi_read_timeout is set and >= 30s
import re
from checkov.common.models.enums import CheckResult

def check_fastcgi_timeout(config_content: str) -> CheckResult:
    match = re.search(r'fastcgi_read_timeout\s+(\d+)', config_content)
    if not match or int(match.group(1)) < 30:
        return CheckResult.FAILED
    return CheckResult.PASSED

3. Prometheus + Alertmanager — alert before pool exhaustion

# prometheus alert: fire at 80% worker saturation, not at 100%
- alert: PHPFPMWorkerSaturation
  expr: |
    (phpfpm_active_processes / phpfpm_max_children) > 0.80
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "PHP-FPM pool {{ $labels.pool }} at {{ $value | humanizePercentage }} capacity"
    runbook: "https://wiki.internal/runbooks/php-fpm-504"

- alert: PHPFPMMaxChildrenReached
  expr: phpfpm_max_children_reached_total > 0
  for: 1m
  labels:
    severity: critical

Use php-fpm_exporter to expose /status to Prometheus. The max children reached counter is the single most important signal — it increments every time FPM had to reject a new worker spawn, which is the direct precursor to a 504 storm.

4. Load test before deploy

# Use k6 to reproduce the saturation condition in staging
k6 run --vus 60 --duration 30s - <<'EOF'
import http from 'k6/http';
export default function () {
  http.post('https://staging.example.com/api/checkout', JSON.stringify({test: true}), {
    headers: { 'Content-Type': 'application/json' },
    timeout: '10s',
  });
}
EOF
# If p99 latency spikes and error rate > 1%, your pool config is undersized for prod traffic.

5. Terraform / Ansible: enforce config-as-code for FPM pool values

# terraform variable validation — prevent deploying with default pool sizes
variable "fpm_max_children" {
  type = number
  validation {
    condition     = var.fpm_max_children >= 20
    error_message = "pm.max_children must be >= 20. Default of 5 causes 504s under any real load."
  }
}

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →