How to Fix Nginx 'upstream failed (111: Connection refused)' Caused by PHP-FPM Pool Exhaustion
Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins
TL;DR
- What broke: PHP-FPM exhausted every worker process in the pool. Nginx attempted to proxy to
localhost:9000, gotECONNREFUSED, and started returning502 Bad Gatewaysite-wide. - How to fix it: Tune
pm.max_childrenbased on available RAM, switch topm = dynamicorpm = ondemand, and setrequest_terminate_timeoutto prevent zombie workers from holding slots. - Fast path: Use our Client-Side Sandbox above to paste your
www.confand auto-refactor the pool config without sending your server details anywhere.
The Incident (What Does the Error Mean?)
Your Nginx error log is printing this:
2024/01/15 03:42:17 [error] 18#18: *58231 connect() failed (111: Connection refused)
while connecting to upstream,
client: 203.0.113.47, server: example.com,
request: "POST /api/checkout HTTP/1.1",
upstream: "http://127.0.0.1:9000",
host: "example.com"
Error 111 (ECONNREFUSED) means the TCP connection was actively refused — the port is bound but nothing is accepting the connection, or the listen backlog queue is overflowing and the kernel is dropping SYNs.
PHP-FPM's process manager has hit pm.max_children. Every worker slot is occupied by a running (or hung) PHP request. The socket/port has no capacity to accept the next connection. Nginx has nowhere to send the request and immediately 502s the user.
Immediate consequence: 100% of PHP-dependent traffic returns HTTP 502. Static assets served directly by Nginx still work, which is why your homepage may load but every dynamic request fails.
The Attack Vector / Blast Radius
Pool exhaustion is a cascading failure, not a single-point event:
- Slow upstream dependency (database, Redis, external API) causes PHP workers to block waiting for I/O. Response times climb from 200ms to 8s+.
- New requests queue behind the blocked workers.
pm.max_childrenis hit. The FPM listen backlog (listen.backlog) fills. - Nginx upstream timeout (
fastcgi_read_timeout) hasn't fired yet, so Nginx keeps the connection open, consuming worker slots on the Nginx side too. - The kernel starts refusing new TCP connections to port 9000 / the UNIX socket — this is the
111: Connection refusedyou see. - Nginx logs the error, returns 502. If you have no circuit breaker, every retry from the client hammers the already-exhausted pool harder.
- OOM risk: If you attempt to fix this in-flight by restarting PHP-FPM under load, the master process spawns
pm.start_serverschildren simultaneously, potentially spiking RAM and triggering the OOM killer on the PHP-FPM master itself.
Secondary blast radius: Any health check endpoint backed by PHP will also fail, causing your load balancer to mark the instance unhealthy and potentially drain it from the pool — turning a performance incident into a full availability incident.
How to Fix It (The Solution)
Step 1 — Diagnose the live pool state first
Before changing anything, confirm exhaustion is the actual cause:
# Check live pool status (requires pm.status_path = /status in www.conf)
curl http://127.0.0.1/fpm-status?full
# Or inspect the process table directly
ps aux | grep php-fpm | wc -l
# Check the socket/port backlog
ss -tlnp | grep 9000
# Tail the FPM slow log for hung workers
tail -f /var/log/php-fpm/www-slow.log
If active processes == max_children, you are pool-exhausted. Proceed.
Basic Fix — Increase pm.max_children correctly
The formula: pm.max_children = floor(Available RAM for PHP / Average RAM per worker)
# Find average worker RAM (run under load, not idle)
ps --no-headers -o rss -C php-fpm | awk '{sum+=$1} END {print sum/NR/1024 " MB per worker"}'
If each worker uses ~80 MB and you have 4 GB available for PHP: 4096 / 80 = 51 max_children.
; /etc/php/8.2/fpm/pool.d/www.conf
- pm = static
- pm.max_children = 5
+ pm = dynamic
+ pm.max_children = 50
+ pm.start_servers = 10
+ pm.min_spare_servers = 5
+ pm.max_spare_servers = 20
+ pm.max_spawn_rate = 4
- request_terminate_timeout = 0
+ request_terminate_timeout = 60s
- pm.max_requests = 0
+ pm.max_requests = 500
Reload, do not restart (avoids dropping in-flight requests):
systemctl reload php8.2-fpm
Enterprise Best Practice — Full hardened pool config
For high-traffic production: separate pools per application, UNIX sockets instead of TCP (lower syscall overhead, no TCP handshake), and strict resource limits.
; /etc/php/8.2/fpm/pool.d/www.conf
[www]
user = www-data
group = www-data
- listen = 127.0.0.1:9000
+ listen = /run/php/php8.2-fpm-www.sock
+ listen.owner = www-data
+ listen.group = www-data
+ listen.mode = 0660
+ listen.backlog = 65535
- pm = static
- pm.max_children = 5
+ pm = dynamic
+ pm.max_children = 50
+ pm.start_servers = 10
+ pm.min_spare_servers = 5
+ pm.max_spare_servers = 20
+ pm.max_spawn_rate = 4
+ pm.process_idle_timeout = 30s
- request_terminate_timeout = 0
+ request_terminate_timeout = 60s
+ request_slowlog_timeout = 5s
+ slowlog = /var/log/php-fpm/$pool-slow.log
- pm.max_requests = 0
+ pm.max_requests = 500
+ pm.status_path = /fpm-status
+ ping.path = /fpm-ping
+ ping.response = pong
+ ; Emergency valve: if pool is exhausted, FPM logs WARN instead of silent drop
+ catch_workers_output = yes
+ decorate_workers_output = no
+ ; Per-worker memory ceiling — kills runaway scripts before OOM killer does
+ php_admin_value[memory_limit] = 256M
+ php_admin_value[max_execution_time] = 60
Update Nginx to use the UNIX socket:
# /etc/nginx/conf.d/app.conf
location ~ \.php$ {
- fastcgi_pass 127.0.0.1:9000;
+ fastcgi_pass unix:/run/php/php8.2-fpm-www.sock;
fastcgi_index index.php;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
+ fastcgi_read_timeout 65;
+ fastcgi_connect_timeout 5;
+ fastcgi_send_timeout 65;
}
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Prevention in CI/CD
1. Load test before deploy — catch exhaustion in staging
# k6 smoke test — run in CI pipeline before promoting to prod
k6 run --vus 60 --duration 30s load-test.js
# Fail the pipeline if p95 latency > 2s or error rate > 1%
2. Ansible / Terraform — enforce pm.max_children is calculated, not hardcoded
# terraform/modules/php_fpm/variables.tf
variable "worker_ram_mb" { default = 80 }
variable "available_ram_mb" {}
locals {
max_children = floor(var.available_ram_mb / var.worker_ram_mb)
}
# Fail fast if someone sets max_children below a safe floor
resource "null_resource" "validate_pool" {
triggers = { max_children = local.max_children }
provisioner "local-exec" {
command = "[ ${local.max_children} -ge 10 ] || (echo 'pm.max_children too low' && exit 1)"
}
}
3. Checkov / OPA policy — reject static pools in IaC
# checkov custom check: no static PHP-FPM pools in prod
from checkov.common.models.enums import CheckResult
from checkov.ansible.checks.task.base_ansible_check import BaseAnsibleCheck
class CheckFPMPoolMode(BaseAnsibleCheck):
def __init__(self):
super().__init__(
name="Ensure PHP-FPM pool uses dynamic or ondemand pm",
id="CKV_ANSIBLE_PHPFPM_1",
supported_tasks=["ini_file", "template"]
)
def get_resource_id(self, conf):
return conf.get("name", "unknown")
def scan_resource_conf(self, conf):
# Flag any config setting pm = static
if "static" in str(conf):
return CheckResult.FAILED
return CheckResult.PASSED
4. Prometheus alerting — fire before users see 502s
# prometheus/alerts/phpfpm.yml
groups:
- name: phpfpm_pool
rules:
- alert: PHPFPMPoolNearExhaustion
expr: |
phpfpm_active_processes / phpfpm_max_active_processes > 0.80
for: 2m
labels:
severity: warning
annotations:
summary: "PHP-FPM pool {{ $labels.pool }} at {{ $value | humanizePercentage }} capacity"
runbook: "https://wiki.internal/runbooks/phpfpm-exhaustion"
- alert: PHPFPMPoolExhausted
expr: |
phpfpm_active_processes >= phpfpm_max_active_processes
for: 30s
labels:
severity: critical
annotations:
summary: "PHP-FPM pool EXHAUSTED — 502s imminent"
5. request_terminate_timeout is your last line of defense
Never leave it at 0. A single slow DB query or infinite loop will permanently occupy a worker slot until FPM or the server is restarted. Set it to 1.5× your fastcgi_read_timeout in Nginx so FPM kills the worker before Nginx gives up and logs the error.