Initializing Enclave...

How to Fix Nginx 'upstream failed (111: Connection refused)' Caused by PHP-FPM Pool Exhaustion

Threat/Impact Level: HIGH | Exploitability/Downtime Risk: HIGH | Time to Fix: 15–30 mins


TL;DR

  • What broke: PHP-FPM exhausted every worker process in the pool. Nginx attempted to proxy to localhost:9000, got ECONNREFUSED, and started returning 502 Bad Gateway site-wide.
  • How to fix it: Tune pm.max_children based on available RAM, switch to pm = dynamic or pm = ondemand, and set request_terminate_timeout to prevent zombie workers from holding slots.
  • Fast path: Use our Client-Side Sandbox above to paste your www.conf and auto-refactor the pool config without sending your server details anywhere.

The Incident (What Does the Error Mean?)

Your Nginx error log is printing this:

2024/01/15 03:42:17 [error] 18#18: *58231 connect() failed (111: Connection refused)
  while connecting to upstream,
  client: 203.0.113.47, server: example.com,
  request: "POST /api/checkout HTTP/1.1",
  upstream: "http://127.0.0.1:9000",
  host: "example.com"

Error 111 (ECONNREFUSED) means the TCP connection was actively refused — the port is bound but nothing is accepting the connection, or the listen backlog queue is overflowing and the kernel is dropping SYNs.

PHP-FPM's process manager has hit pm.max_children. Every worker slot is occupied by a running (or hung) PHP request. The socket/port has no capacity to accept the next connection. Nginx has nowhere to send the request and immediately 502s the user.

Immediate consequence: 100% of PHP-dependent traffic returns HTTP 502. Static assets served directly by Nginx still work, which is why your homepage may load but every dynamic request fails.


The Attack Vector / Blast Radius

Pool exhaustion is a cascading failure, not a single-point event:

  1. Slow upstream dependency (database, Redis, external API) causes PHP workers to block waiting for I/O. Response times climb from 200ms to 8s+.
  2. New requests queue behind the blocked workers. pm.max_children is hit. The FPM listen backlog (listen.backlog) fills.
  3. Nginx upstream timeout (fastcgi_read_timeout) hasn't fired yet, so Nginx keeps the connection open, consuming worker slots on the Nginx side too.
  4. The kernel starts refusing new TCP connections to port 9000 / the UNIX socket — this is the 111: Connection refused you see.
  5. Nginx logs the error, returns 502. If you have no circuit breaker, every retry from the client hammers the already-exhausted pool harder.
  6. OOM risk: If you attempt to fix this in-flight by restarting PHP-FPM under load, the master process spawns pm.start_servers children simultaneously, potentially spiking RAM and triggering the OOM killer on the PHP-FPM master itself.

Secondary blast radius: Any health check endpoint backed by PHP will also fail, causing your load balancer to mark the instance unhealthy and potentially drain it from the pool — turning a performance incident into a full availability incident.


How to Fix It (The Solution)

Step 1 — Diagnose the live pool state first

Before changing anything, confirm exhaustion is the actual cause:

# Check live pool status (requires pm.status_path = /status in www.conf)
curl http://127.0.0.1/fpm-status?full

# Or inspect the process table directly
ps aux | grep php-fpm | wc -l

# Check the socket/port backlog
ss -tlnp | grep 9000

# Tail the FPM slow log for hung workers
tail -f /var/log/php-fpm/www-slow.log

If active processes == max_children, you are pool-exhausted. Proceed.


Basic Fix — Increase pm.max_children correctly

The formula: pm.max_children = floor(Available RAM for PHP / Average RAM per worker)

# Find average worker RAM (run under load, not idle)
ps --no-headers -o rss -C php-fpm | awk '{sum+=$1} END {print sum/NR/1024 " MB per worker"}'

If each worker uses ~80 MB and you have 4 GB available for PHP: 4096 / 80 = 51 max_children.

; /etc/php/8.2/fpm/pool.d/www.conf

- pm = static
- pm.max_children = 5
+ pm = dynamic
+ pm.max_children = 50
+ pm.start_servers = 10
+ pm.min_spare_servers = 5
+ pm.max_spare_servers = 20
+ pm.max_spawn_rate = 4

- request_terminate_timeout = 0
+ request_terminate_timeout = 60s

- pm.max_requests = 0
+ pm.max_requests = 500

Reload, do not restart (avoids dropping in-flight requests):

systemctl reload php8.2-fpm

Enterprise Best Practice — Full hardened pool config

For high-traffic production: separate pools per application, UNIX sockets instead of TCP (lower syscall overhead, no TCP handshake), and strict resource limits.

; /etc/php/8.2/fpm/pool.d/www.conf

  [www]
  user = www-data
  group = www-data

- listen = 127.0.0.1:9000
+ listen = /run/php/php8.2-fpm-www.sock
+ listen.owner = www-data
+ listen.group = www-data
+ listen.mode = 0660
+ listen.backlog = 65535

- pm = static
- pm.max_children = 5
+ pm = dynamic
+ pm.max_children = 50
+ pm.start_servers = 10
+ pm.min_spare_servers = 5
+ pm.max_spare_servers = 20
+ pm.max_spawn_rate = 4
+ pm.process_idle_timeout = 30s

- request_terminate_timeout = 0
+ request_terminate_timeout = 60s
+ request_slowlog_timeout = 5s
+ slowlog = /var/log/php-fpm/$pool-slow.log

- pm.max_requests = 0
+ pm.max_requests = 500

+ pm.status_path = /fpm-status
+ ping.path = /fpm-ping
+ ping.response = pong

+ ; Emergency valve: if pool is exhausted, FPM logs WARN instead of silent drop
+ catch_workers_output = yes
+ decorate_workers_output = no

+ ; Per-worker memory ceiling — kills runaway scripts before OOM killer does
+ php_admin_value[memory_limit] = 256M
+ php_admin_value[max_execution_time] = 60

Update Nginx to use the UNIX socket:

# /etc/nginx/conf.d/app.conf

  location ~ \.php$ {
-   fastcgi_pass 127.0.0.1:9000;
+   fastcgi_pass unix:/run/php/php8.2-fpm-www.sock;
    fastcgi_index index.php;
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
+   fastcgi_read_timeout 65;
+   fastcgi_connect_timeout 5;
+   fastcgi_send_timeout 65;
  }

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Load test before deploy — catch exhaustion in staging

# k6 smoke test — run in CI pipeline before promoting to prod
k6 run --vus 60 --duration 30s load-test.js
# Fail the pipeline if p95 latency > 2s or error rate > 1%

2. Ansible / Terraform — enforce pm.max_children is calculated, not hardcoded

# terraform/modules/php_fpm/variables.tf
variable "worker_ram_mb" { default = 80 }
variable "available_ram_mb" {}

locals {
  max_children = floor(var.available_ram_mb / var.worker_ram_mb)
}

# Fail fast if someone sets max_children below a safe floor
resource "null_resource" "validate_pool" {
  triggers = { max_children = local.max_children }
  provisioner "local-exec" {
    command = "[ ${local.max_children} -ge 10 ] || (echo 'pm.max_children too low' && exit 1)"
  }
}

3. Checkov / OPA policy — reject static pools in IaC

# checkov custom check: no static PHP-FPM pools in prod
from checkov.common.models.enums import CheckResult
from checkov.ansible.checks.task.base_ansible_check import BaseAnsibleCheck

class CheckFPMPoolMode(BaseAnsibleCheck):
    def __init__(self):
        super().__init__(
            name="Ensure PHP-FPM pool uses dynamic or ondemand pm",
            id="CKV_ANSIBLE_PHPFPM_1",
            supported_tasks=["ini_file", "template"]
        )
    def get_resource_id(self, conf):
        return conf.get("name", "unknown")
    def scan_resource_conf(self, conf):
        # Flag any config setting pm = static
        if "static" in str(conf):
            return CheckResult.FAILED
        return CheckResult.PASSED

4. Prometheus alerting — fire before users see 502s

# prometheus/alerts/phpfpm.yml
groups:
  - name: phpfpm_pool
    rules:
      - alert: PHPFPMPoolNearExhaustion
        expr: |
          phpfpm_active_processes / phpfpm_max_active_processes > 0.80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "PHP-FPM pool {{ $labels.pool }} at {{ $value | humanizePercentage }} capacity"
          runbook: "https://wiki.internal/runbooks/phpfpm-exhaustion"

      - alert: PHPFPMPoolExhausted
        expr: |
          phpfpm_active_processes >= phpfpm_max_active_processes
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "PHP-FPM pool EXHAUSTED — 502s imminent"

5. request_terminate_timeout is your last line of defense

Never leave it at 0. A single slow DB query or infinite loop will permanently occupy a worker slot until FPM or the server is restarted. Set it to 1.5× your fastcgi_read_timeout in Nginx so FPM kills the worker before Nginx gives up and logs the error.

Related Diagnostics

"Part of the Performance Utility Matrix."

View all 219 Performance Tools →