Apache 2.4's event MPM can require more workers than you'd expect
When we upgraded from Ubuntu 18.04 to Ubuntu 22.04, we moved away from the prefork MPM to the event MPM. A significant reason for this shift is that our primary public web server has wound up with a high traffic level from people downloading things, often relatively large datasets (for example). My intuition was that much of the traffic was from low-rate connections that were mostly idle on the server as they waited for slow remote networks. The event MPM is supposed to have various features to deal with this sort of usage (as well as connections idle in the HTTP 'keep alive' state, waiting to see if there's more traffic).
Soon after we changed over we found that we had to raise various
event MPM limits, and since then
we've raised them twice more. This includes the limit on the number
of workers, Apache's
Our Apache metrics say that when our MaxRequestsWorkers setting was
1000, we managed to hit that limit with busy workers. We're now up
to 2,000 workers on that web server, which on the one hand feels
absurd to me but on the other hand, 1,000 clearly wasn't enough.
One possible reason for this is that I may have misunderstood how frequently connections are idle or, to quote the event MPM documentation, "where the only remaining thing to do is send the data to the client". I had assumed (without testing) that once a connection was simply writing out data from a file to the client, it fell into this state, but possibly this is only for when Apache is buffering the last remaining data itself. Since the popular requests are multi-megabyte files, they'd spend most of their transfer with Apache still reading from the files. Certainly our captured metrics suggest that we don't see very many connections that Apache's status module reports as asynchronous connections that are writing things.
For our web server's current usage, these settings are okay. But they're unfortunately dangerous, because we allow people to run CGIs on this server, and the machine is unlikely to do well if we have even 1,000 CGIs running at the same time. In practice not many CGIs get run these days, so we're likely going to get away with it. Still, it makes me nervous and I wish we had a better solution.
(If it does become a problem I can think of some options, although they're generally terrible hacks.)