Our Apache file serving problem on our general purpose web server
One of the servers we run for our department is an old-fashioned general purpose web server that hosts things like people's home pages and the web pages for (some) research groups. In terms of content, we have a mix of static files, old-fashioned CGIs (run through suexec), and reverse proxies to user run web servers. One of the things people here do with this web server is use it to share research data files and datasets, generally through their personal home page because that's the easy way to go. Some of these files are pretty large.
When you share data, people download it; sometimes a lot of people, because sometimes computer scientists share hot research results. This is no problem from a bandwidth perspective; we (the department and the university) have lots of bandwidth (it's not like the old days) and we'd love to see it used. However, some number of the people asking for this data are on relatively slow connections, and some of these data files are large. When you combine these two, you get very slow downloads and thus client HTTP connections that stick around for quite a long time.
(Since 6am this morning, we've seen 27 requests that took more than an hour to complete, 265 that took more than ten minutes, and over 7,500 that took more than a minute.)
For historical reasons we're using the 'prefork' Apache MPM, and perhaps you now see the problem. Each low-bandwidth client that's downloading a big file occupies a whole worker process for what is a very long time (by web server standards). We feel we can only configure so many worker processes, mostly because each of them eats a certain amount of the machine's finite memory, and we've repeatedly had all our worker processes eaten up by these slow clients, locking out all other requests for other URLs for a while. The clients come and go, for reasons we're not certain of; perhaps someone is posting a link somewhere, or maybe a classroom of people are being directed to download some sample data or the like. It's honestly kind of mysterious to us.
(In theory we could also worry about how many worker processes we allow because each worker process could someday be a CGI that's running at the same time as other CGIs, and if we run too many CGIs at once the web server explodes. In practice we've already configured so many worker processes in an attempt to keep some request slots open during these 'slow clients, popular file' situations that our web server would likely explode if even half of the current worker processes were running CGIs at once.)
Right now we're resorting to using mod_qos to try to limit access to currently popular things, but this isn't ideal for several reasons. What we really want is a hybrid web serving model, where just pushing files out to clients is done with a lightweight, highly scalable method that's basically free but Apache can continue to handle CGIs in something like the traditional model. Ideally we could even turn down the 'CGI workers' count, now that they don't have to also be 'file workers'.
Changing web servers away from Apache isn't an option and neither is splitting the static files off to another server entirely. Based on my reading so far, trying to switch to the event MPM looks like our most promising option; in fact in theory the event MPM sounds very close to our ideal setup. I'm not certain how it interacts with CGIs, though; the Apache documentation suggests that we might need or want to switch to mod_cgid, and that's going to require testing (the documentation claims it's basically a drop-in replacement, but I'm not sure I trust that).
(Setting suitable configuration parameters for a thread-based MPM is
going to be a new and somewhat exciting area for us, too. It seems
ThreadsPerChild is the important tuning knob, but I have
no idea what the tradeoffs are. Perhaps we should take the default
Ubuntu 16.04 settings for everything except
AsyncRequestWorkerFactor, which we might want to tune up if we
expect lots of waiting connections.)