2016-04-26
How 'there are no technical solutions to social problems' is wrong
One of the things that you will hear echoing around the Internet is the saying that there are no technical solutions to social problems. This is sometimes called 'Ranum's Law', where it's generally phrased as 'you can't fix people problems with software' (cf). Years ago you probably could have found me nodding along sagely to this and full-heartedly agreeing with it. However, I've changed; these days, I disagree with the spirit of the saying.
It is certainly true you cannot outright solve social problems with technology (well, almost all of the time). Technology is not that magical, and the social is more powerful than the technical barring very unusual situations. And in general social problems are wicked problems, and those are extremely difficult to tackle in general. This is an important thing to realize, because social problems matter and computing has a great tendency to either ignore them outright or assume that our technology will magically solve them for us.
However, the way that this saying is often used is for technologists to wash their hands of the social problems entirely, and this is a complete and utter mistake. It is not true that technical measures are either useless or socially neutral, because the technical is part of the world and so it basically always affects the social. In practice, in reality, technical features often strongly influence social outcomes, and it follows that they can make social problems more or less likely. That social problems matter means that we need to explicitly consider them when building technical things.
(The glaring example of this is all the various forms of spam. Spam is a social problem, but it can be drastically enabled or drastically hindered by all sorts of technical measures and so sensible modern designers aggressively try to design spam out of their technical systems.)
If we ignore the social effects of our technical decisions, we are doing it wrong (and bad things usually ensue). If we try to pretend that our technical decisions do not have social ramifications, we are either in denial or fools. It doesn't matter whether we intended the social ramifications or didn't think about them; in either case, we may rightfully be at least partially blamed for the consequences of our decisions. The world does not care why we did something, all it cares about is what consequences our decisions have. And our decisions very definitely have (social) consequences, even for small and simple decisions like refusing to let people change their login names.
Ranum's Law is not an excuse to live in a rarefied world where all is technical and only technical, because such a rarefied world does not exist. To the extent that we pretend it exists, it is a carefully cultivated illusion. We are certainly not fooling other people with the illusion; we may or may not be fooling ourselves.
(I feel I have some claim to know what the original spirit of the saying was because I happened to be around in the right places at the right time to hear early versions of it. At the time it was fairly strongly a 'there is no point in even trying' remark.)
Bad slide navigation on the web and understanding why it's bad
As usual, I'll start with my tweet:
If the online copy of your slide presentation is structured in 2D, not just 'go forward', please know that I just closed my browser window.
This is sort of opaque because of the 140 character limitation, so let me unpack it.
People put slide decks for their presentations online using various bits of technology. Most of the time how you navigate through those decks is strictly linear; you have 'next slide' and 'previous slide' in some form. But there's another somewhat popular form I run across every so often, where the navigation down in the bottom right corner offers you a left / right / up / down compass rose. Normally you go through the slide deck by moving right (forward), but some slides have more slides below them so you have to switch to going down to the end, then going right again.
These days, I close the browser window on those slide presentations. They're simply not worth the hassle of dealing with the navigation.
There are a number of reasons why this navigation is bad on the web (and probably in general) beyond the obvious. To start with, there's generally no warning cue on a slide itself that it's the top of an up/down stack of slides (and not all slides at the top level are). Instead I have to pay attention to the presence or absence of a little down arrow all the way over on the side of the display, well away from what I'm paying attention to. It is extremely easy to miss this cue and thus skip a whole series of slides. At best this gives me an extremely abbreviated version of the slide deck until I realize, back up, and try to find the stacks I missed.
This lack of cues combines terribly with the other attribute of slides, which is that good slides are very low density and thus will be read fast. When a slide has at most a sentence or two, I'm going to be spending only seconds a slide (I read fast) and the whole slide deck is often a stream of information. Except that it's a stream that I can't just go 'next next next' through, because I have to stop to figure out what I do next and keep track of whether I'm going right or down and so on. I'm pretty sure that on some 'sparse slides' presentations this would literally double the amount of time I spend per slide, and worse it interrupts the context of my reading; one moment I'm absorbing this slide, the next I'm switching contexts to figure out where to navigate to, then I'm back to absorbing the next slide, then etc. I get whiplash. It's not a very pleasant way to read something.
Multi-option HTML navigation works best when it is infrequent and clear. We all hate those articles that have been sliced up into multiple pages with only a couple of paragraphs per page, and it's for good reason; we want to read the information, not navigate from page to page to page. The more complicated and obscure you make the navigation, the worse it is. This sort of slide presentation is an extreme version of multi-page articles with less clear navigation than normal HTML links (which are themselves often obscured these days).
I don't think any of this is particularly novel or even non-obvious, and I sure hope that people doing web design are thinking about these information architecture issues. But people still keep designing what I think of as terribly broken web navigation experiences anyways, these slide decks being one of them. I could speculate about why, but all of the reasons are depressing.
(Yes, including that my tastes here are unusual, because if my tastes are unusual it means that I'm basically doomed to a lot of bad web experiences. Oh well, generally I don't really need to read those slide decks et al, so in a sense people are helpfully saving my time. There's an increasing number of links on Twitter that I don't even bother following because I know I won't be able to read them due to the site they're on.)
Sidebar: Where I suspect this design idea got started
Imagine a slide deck where there you've added some optional extra material at various spots. Depending on timing and audience interest, you could include some or all of this material or you could skip over it. This material logically 'hangs off' certain slides (in that between slide A and F there are optional slides C, D, and E 'hanging off' A).
This slide structure makes sense to represent in 2D for presentation purposes. Your main line of presentation (all the stuff that really has to be there) is along the top, then the optional pieces go below the various spots they hang off of. At any time you can move forward to the next main line slide, or start moving through the bit of extra material that's appropriate to the current context (ie, you go down a stack).
Then two (or three) things went wrong. First, the presentation focused structure was copied literally to the web for general viewing, when probably it should be shifted into linear form. Second, there were no prominent markers added for 'there is extra material below' (the presenter knows this already, but general readers don't). Finally, people took this 2D structure and put important material 'down' instead of restricting down to purely additional material. Now a reader has to navigate in 2D instead of 1D, and is doing so without cues that should really be there.
2016-04-25
Why you mostly don't want to do in-place Linux version upgrades
I mentioned yesterday that we don't do in-place distribution upgrades, eg to go from Ubuntu 12.04 to 14.04; instead we rebuild starting from scratch. It's my view that in-place upgrades of at least common Linux distributions are often a bad idea for a server fleet even when they're supported. I have three reasons for this, in order of increasing importance.
First, an in place upgrade generally involves more service downtime or at least instability than a server swap. In-place upgrades generally take some time (possibly in the hours range), during which things may be at least a little bit unstable as core portions of the system are swapped around (such as core shared libraries, Apache and MySQL/PostgreSQL installs, the mailer, your IMAP server, and so on). A server swap is a few minutes of downtime and you're done.
Second, it's undeniable that an in-place upgrade is a bit more risky than a server replacement. With a server replacement you can build and test the replacement in advance, and you also can revert back to the old version of the server if there are problems with the new one (which we've had to do a few times). For most Linux servers, an in place OS upgrade is a one way thing that's hard to test.
(In theory you can test it by rebuilding an exact duplicate of your current server and then running it through an in-place upgrade, but if you're going to go to that much more work why not just build a new server to start with?)
But those are relatively small reasons. The big reason to rebuild from scratch is that an OS version change means that it's time to re-evaluate whether what you were customizing on the old OS still needs to be done, if you're doing it the right way, and if you now need additional customizations because of new things on the OS. Or, for that matter, because your own environment has changed and some thing you were reflexively doing is now pointless or wrong. Sometimes this is an obvious need, such as Ubuntu's shift from Upstart in 14.04 LTS to systemd in 16.04, but often it can be more subtle than that. Do you still need that sysctl setting, that kernel module blacklist, or that bug workaround, or has the new release made it obsolete?
Again, in theory you can look into this (and prepare new configuration files for new versions of software) by building out a test server before you do in-place upgrades of your existing fleet. In practice I think it's much easier to do this well and to have everything properly prepared if you start from scratch with the new version. Starting from scratch gives you a totally clean slate where you can carefully track and verify every change you do to a stock install.
Of course all of this assumes that you have spare servers that you can use for this. You may not for various reasons, and in that case an in-place upgrade can be the best option in practice despite everything I've written. And when it is your best option, it's great if your Linux (or other OS) actively supports it (Debian and I believe Ubuntu), as opposed to grudging support (Fedora) or no support at all (RHEL/CentOS).
2016-04-24
Why we have CentOS machines as well as Ubuntu ones
I'll start with the tweets that I ran across semi-recently (via @bridgetkromhout):
@alicegoldfuss: If you're running Ubuntu and some guy comes in and says 'we should use Redhat'...fuck that guy." - @mipsytipsy #SREcon16
mipsytipsy: alright, ppl keep turning this into an OS war; it is not. supporting multiple things is costly so try to avoid it.
This is absolutely true. But, well, sometimes you wind up with exceptions despite how you may feel.
We're an Ubuntu shop; it's the Linux we run and almost all of our machines are Linux machines. Despite this we still have a few CentOS machines lurking around, so today I thought I'd explain why they persist despite their extra support burden.
The easiest machine to explain is the one machine running CentOS 6. It's running CentOS 6 for the simple reason that that's basically the last remaining supported Linux distribution that Sophos PureMessage officially runs on. If we want to keep running PureMessage in our anti-spam setup (and we do), CentOS 6 is it. We'd rather run this machine on Ubuntu and we used to before Sophos's last supported Ubuntu version aged out of support.
Our current generation iSCSI backends run CentOS 7 because of the long support period it gives us. We treat these machines as appliances and freeze them once installed, but we still want at least the possibility of applying security updates if there's a sufficiently big issue (an OpenSSH exposure, for example). Because these machines are so crucial to our environment we want to qualify them once and then never touch them again, and CentOS has a long enough support period to more than cover their expected five year lifespan.
Finally, we have a couple of syslog servers and a console server that run CentOS 7. This is somewhat due to historical reasons, but in general we're happy with this choice; these are machines that are deliberately entirely isolated from our regular management infrastructure and that we want to just sit in a corner and keep working smoothly for as long as possible. Basing them on CentOS 7 gives us a very long support period and means we probably won't touch them again until the hardware is old enough to start worrying us (which will probably take a while).
The common feature here is the really long support period that RHEL and CentOS gives us. If all we want is basic garden variety server functionality (possibly because we're running our own code on top, as with the iSCSI backends), we don't really care about using the latest and greatest software versions and it's an advantage to not have to worry about big things like OS upgrades (which for us is actually 'build completely new instance of the server from scratch'; we don't attempt in-place upgrades of that degree and they probably wouldn't really work anyways for reasons out of the scope of this entry).
2016-04-23
Why I think Illumos/OmniOS uses PCI subsystem IDs
As I mentioned yesterday, PCI has both
vendor/device IDs and 'subsystem' vendor/device IDs. Here is what
this looks like (in Linux) for a random device on one of our machines
here (from 'lspci -vnn', more or less):
04:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0086] (rev 05)
Subsystem: Super Micro Computer Inc Device [15d9:0691]
[...]
This is the integrated motherboard SAS controller on a SuperMicro motherboard (part of our fileserver hardware). It's using a standard LSI chipset, as reported in the main PCI vendor and device ID, but the subsystem ID says it's from SuperMicro. Similarly, this is an Intel chipset based motherboard so there are a lot of things with standard Intel vendor and device IDs, but SuperMicro specific subsystem vendor and device IDs.
As far as I know, most systems use the PCI vendor and device IDs and mostly ignore the subsystem vendor and device IDs. It's not hard to see why; the main IDs tell you more about what the device actually is, and there are fewer of them to keep track of. Illumos is an exception, where much of the PCI information you see reported uses subsystem IDs. I believe that a significant reason for this is that Illumos is often attempting to basically fingerprint devices.
Illumos tries hard to have some degree of constant device naming
(at least for their definition of it), so that say 'e1000g0' is
always the same thing. This requires being able to identify specific
hardware devices as much as possible, so you can tie them to the
visible system-level names you've established. This is the purpose
of /etc/path_to_inst and the systems associated with it; it
fingerprints devices on first contact, assigns them an identifier
(in the form of a driver plus an instance number), and thereafter
tries to keep them exactly the same.
(From Illumos's perspective the ideal solution would be for each single PCI device to have a UUID or other unique identifier. But such a thing doesn't exist, at least not in general. So Illumos must fake a unique identifier by using some form of fingerprinting.)
If you want a device fingerprint, the PCI subsystem IDs are generally going to be more specific than the main IDs. A whole lot of very different LSI SAS controllers have 1000:0086 as their PCI vendor and device IDs, after all; that's basically the purpose of having the split. Using the SuperMicro subsystem vendor and device IDs ties it to 'the motherboard SAS controller on this specific type of motherboard', which is much closer to being a unique device identifier.
Note that Illumos's approach more or less explicitly errs on the side of declaring devices to be new. If you shuffle which slots your PCI cards are in, Illumos will declare them all to be new devices and force you to reconfigure things. However, this is broadly much more conservative than doing it the other way. Essentially Illumos says 'if I can see that something changed, I'm not going to go ahead and use your existing settings'. Maybe it's a harmless change where you just shuffled card slots, or maybe it's a sign of something more severe. Illumos doesn't know and isn't going to guess; you get to tell it.
(I do wish there were better tools to tell Illumos that certain changes were harmless and expected. It's kind of a pain that eg moving cards between PCI slots can cause such a commotion.)
2016-04-22
What Illumos/OmniOS PCI device names seem to mean
When working on an OmniOS system, under normal circumstances you'll
use friendly device names from /dev and things like dladm (for
network devices). However, Illumos-based systems have an underlying
hardware based naming scheme (exposed in /devices), and under
some circumstances you can wind up
dealing with it. When you do, you'll be confronted with relatively
opaque names like '/pci@0,0/pci8086,e04@2/pci8086,115e@0' and
very little clue what these names actually mean, at least if you're
not already an Illumos/Solaris expert.
So let's take just one bit here: pci8086,e04@2. The pci8086,e04
portion is the PCI subsystem vendor and device code, expressed in
hex. You'll probably see '8086' a lot, because it's the vendor code for
Intel. Then the @2 portion is PCI path information expressed relative
to the parent. This can get complicated, because 'path relative to the
parent' doesn't map well to the kinds of PCI names you get on Linux
from eg 'lspci'. When you see a '@...' portion with a comma, that is
what other systems would label
as 'device.function'. If there is no comma in the '@..' portion, the
function is implicitly 0.
(Note that the PCI subsystem vendor and device is different from
the PCI vendor and device. Linux 'lspci -n' shows only the vendor
code, because that's what's important for knowing what sort of thing
it is instead of who exactly made it; you have to use 'lspci -vn'
to see the subsystem stuff. Illumos's PCI names here are inherently
framed as a PCI tree, whereas Linux lspci normally does not show
the tree topology, just flat slot numbering. See 'lspci -t' for
the tree view.)
As far as I can tell, in a modern PCI Express setup the physical
slot you put a card into will determine the first two elements of
the PCI path. '/pci@0,0' is just a (synthetic) PCI root instance,
and then '/pci8086,e04@2' is a specific PCI Express Port. However,
I'm not sure if one PCI Express Port can feed multiple slots and
if it can, I'm not sure how you tell them apart. I'm not quite sure
how things work for plain PCI cards, but for onboard PCI devices
you get PCI paths like '/pci@0,0/pci15d9,714@1a' where the '@1a'
corresponds to what Linux lspci sees as 00:1a.0.
So, suppose that you have a collection of OmniOS servers and you want to know if they have exactly
the same PCI Express cards in exactly the same slots (or, say,
exactly the same Intel 1G based network cards). If you look at
/etc/path_to_inst and see exactly the same PCI paths, you've
got what you want. If you look at the paths and see two systems
with say:
s1: /pci@0,0/pci8086,e04@2/pci8086,135e@0
s2: /pci@0,0/pci8086,e04@2/pci8086,115e@0
What you have is a situation where the cards are in the same slots
(because the first two elements of the path are the same) but they're
slightly different generations and Intel has changed the PCI subsystem
device code on you (seen in ',135e' versus ',115e'). If you're
transplanting system disks from s2 to s1, this can cause problems
that you'll need to deal with by editing path_to_inst.
I don't know what order Illumos uses when choosing how to assign instances (and thus eg network device names) to hardware when you have multiple instances of the same hardware. On a single card with multiple ports it seems consistent that the port with the lower function is assigned first, eg if you have a dual port card where the ports are pci8086,115e@0 and pci8086,115e@0,1, the @0 port will always be a lower instance than the @0,1 port. How multiple cards are handled is not clear to me and I can't reverse engineer it based on our current hardware.
(While we have multiple Intel 1G dual-port cards in our OmniOS fileservers, they are in PCI Express slots that differ both in the PCI subdevice and in the PCI path information; we have pci8086,e04@2 as the PCI Express Port for the first card and pci8086,e0a@3,2 for the second. I suspect that the PCI path information ('@2' versus '@3,2') determines things here, but I don't know for sure.)
PS: Yes, all of this is confusing (at least to me). Maybe I need to read up on general principles of PCI, PCI Express, and how all the topology stuff works (the PCI bus world is clearly not flat any more, if it ever was).
2016-04-20
A brief review of the HP three button USB optical mouse
The short background is that I'm strongly attached to real three button mice (mice where the middle mouse button is not just a scroll wheel), for good reason. This is a slowly increasing problem primarily because my current three button mice are all PS/2 mice and PS/2 ports are probably going to be somewhat hard to find on future motherboards (and PS/2 to USB converters are finicky beasts).
One of the very few three button USB mice you can find is a HP mouse (model DY651A); it's come up in helpful comments here several times (and see also Peter da Silva). Online commentary on it has been mixed with some people not very happy with it. Last November I noticed that we could get one for under $20 (Canadian, delivery included), so I had work buy me one; I figured that even if it didn't work for me, having another mouse around for test machines wouldn't be a bad thing. At this point I've used it at work for a few months and I've formed some opinions.
The mouse's good side is straightforward. It's a real three button
USB optical mouse, it works, and it costs under $20 on Amazon.
It's not actually made by HP, of course; it turns out to be a lightly
rebranded Logitech (xinput reports it as 'Logitech USB Optical
Mouse'), which is good because Logitech made a lot of good three
button mice back in the days. There are reports that it's not durable
over the long term but at under $20 a pop, I suggest not caring if
it only lasts a few years. Buy spares in advance if you want to,
just in case it goes out of production on you.
(And if you're coming from a PS/2 ball mouse, modern optical mouse tracking is plain nicer and smoother.)
On the bad side there are two issues. The minor one is that my copy
seems to have become a little bit hair trigger on the middle mouse
button already, in that every so often I'll click once (eg to do a
single paste in xterm) and X registers two clicks (so I get things
pasted twice in xterm). It's possible that this mouse just needs a
lighter touch in general than I'm used to.
The larger issue for me is that the shape of the mouse is just not
as nice as Logitech's old three button PS/2 mice. It's still a
perfectly usable and reasonably pleasant mouse, it just doesn't
feel as nice as my old PS/2 mouse (to the extent that I can put my
finger on anything specific, I think that the front feels a bit too
steep and maybe too short). My overall feeling after using the HP
mouse for several months is that it's just okay instead of rather
nice the way I'm used to my PS/2 mouse feeling. I could certainly
use the HP mouse; it's just that I'd rather use my PS/2 mouse.
(For reasons beyond the scope of this entry I think it's specifically the shape of the HP mouse, not just that it's different from my PS/2 mouse and I haven't acclimatized to the difference.)
The end result is that I've switched back to my PS/2 mouse at work. Reverting from optical tracking to a mouse ball is a bit of a step backwards but having a mouse that feels fully comfortable under my hand is more than worth it. I currently plan to keep on using my PS/2 mouse for as long as I can still connect it to my machine (and since my work machine is unlikely to be upgraded any time soon, that's probably a good long time).
Overall, if you need a three button USB mouse the HP is cheap and perfectly usable, and you may like its feel more than I do. At $20, I think it's worth a try even if it doesn't work out; if nothing else, you'll wind up with an emergency spare three button mouse (or a mouse for secondary machines).
(And unfortunately it's not like we have a lot of choice here. At least the HP gives us three button people an option.)
How to get Unbound to selectively add or override DNS records
Suppose, not entirely hypothetically, that you're using Unbound and you have a situation where you want to shim some local information into the normal DNS data (either adding records that don't exist naturally or overriding some that do). You don't want to totally overwrite a zone, just add some things. The good news is that Unbound can actually do this, and in a relatively straightforward way (unlike, say, Bind, where if this is possible at all it's not obvious).
You basically have two options, depending on what you want to do with the names you're overriding. I'll illustrate both of these:
local-zone: example.org typetransparent local-data: "server.example.org A 8.8.8.8"
Here we have added or overridden an A record for server.example.org.
Any other DNS records for server.example.org will be returned
as-is, such as MX records.
local-zone: example.com transparent local-data: "server.example.com A 9.9.9.9"
We've supplied our own A record for server.example.com, but we've
also effectively deleted all other DNS records for it. If it has
an MX record or a TXT record or what have you, those records will
not be visible. For any names in transparent local-data zones, you
are in complete control of all records returned; either they're in
your local-data stanzas, or they don't exist.
Note that if you just give local-data for something without a
local-zone directive, Unbound silently makes it into such a
transparent local zone.
Transparent local zones have one gotcha, which I will now illustrate:
local-zone: example.net transparent local-data: "example.net A 7.7.7.7"
Because this is a transparent zone and we haven't listed any NS
records for example.net as part of our local data, people will
not be able to look up any names inside the zone even though we
don't explicitly block or override them. Of course if we did list
some additional names inside example.net as local-data, people would
be able to look up them (and only them). This can be a bit puzzling
until you work out what's going on.
(Since transparent local zones are the default, note that this
happens if you leave out the local-zone or get the name wrong by
mistake or accident.)
As far as I know, there's no way to use a typetransparent zone but delete certain record types for some names, which you'd use so you can do things like remove all MX entries for some host names. However, Unbound's idea of 'zones' don't have to map to actual DNS zones, so you can do this:
local-zone: example.org typetransparent local-data: "server.example.org A 8.8.8.8" # but: local-zone: www.example.org transparent local-data: "www.example.org A 8.8.8.8"
By claiming www.example.org as a separate transparent local zone,
this allows us to delete all records for it but the A record that
we supply; this would remove, say, MX entries. Since I just tried
this out, note that a transparent local zone with no data naturally
doesn't blank out anything, so if you want to totally delete a
name's records you need to supply some dummy record (eg a TXT
record).
(We've turned out to not need to do this right now, but since I worked out how to do it I want to write it down before I forget.)
2016-04-19
Today's odd spammer behavior for sender addresses
It's not news that spammers like to forge your own addresses into
the MAIL FROMs of the spam that they're trying to send you; I've
seen this here for some time.
On the machine where I have my sinkhole server running, this clearly comes
and goes. Some of the time almost all the senders will be trying a
legitimate MAIL FROM (often what they seem to be trying to mail
to), and other times I won't see any in the logs for weeks. But
recently there's been a new and odd behavior.
Right now, a surprising number of sending attempts are using a MAIL
FROM that is (or was) a real address, but with the first letter
removed. If 'joey@domain' was once a real address, they are trying
a MAIL FROM of 'oey@domain'. They're not just picking on a single
address that is mutilated this way, as I see the pattern with a
number of addresses.
(Some of the time they'll add some letters after the login name too, eg 'joey@domain' will turn into 'oeyn@domain'.)
So far I have no idea what specific spam campaign this is for because all of the senders have been in the Spamhaus XBL (this currently gets my sinkhole server to reject them as boring spam that I already have enough samples of).
What really puzzles me is what the spammers who programmed this are
thinking. It's probably quite likely that systems will reject bad
local addresses in MAIL FROMs for incoming email, which means
that starting with addresses you think are good and then mutating
them is a great way to get a lot of your spam sending attempts
rejected immediately. Yet spammers are setting up their systems to
deliberately mutate addresses and then use them as the sender
address, and presumably this both works and is worthwhile for some
reason.
(Perhaps they're trying to bash their way through address obfuscation, even when the address isn't obfuscated.)
(I suspect that this is a single spammer that has latched on to my
now spamtrap addresses, instead of a general thing. Our general
inbound mail gateway gets too much volume for me to pick through
the 'no such local user' MAIL FROM rejections with any confidence
that I'd spot such a pattern.)
2016-04-18
Why your Apache should have mod_status configured somewhere
Recently, our monitoring system alerted us that our central web server wasn't responding. I poked it and indeed, it wasn't responding, but when I looked at the server everything seemed okay and the logs said it was responding to requests (a lot of them, in fact). Then a little bit later monitoring said it was responding again. Then it wasn't responding. Then my attempt to look at a URL from it worked, but only really slowly.
If you're a long-term Apache wrangler, you can probably already guess the cause. You would be correct; what was going on was that our Apache was being hit with so many requests at once that it was running out of worker processes. If it got through enough work in time, it would eventually pick up your request and satisfy it; if it didn't, you timed out. And if you were lucky, maybe you could get a request in during a lull in all the requests and it would be handled right away.
Once we'd identified the overall cause, we needed to know who or what was doing it. Our central web server handles a wide variety of URLs for a lot of people, some of which can get popular from time to time, so there were a lot of options. And nothing stood out in a quick scan of the logs as receiving a wall of requests or the like. Now, I'm sure that we could have done some more careful log analysis to determine the most active URLs and the most active sources over the last hour or half hour or something, but that would have taken time and effort and we still might have missed sometime. Instead I took the brute force approach: I added mod_status to the server's configuration, on a non-standard URL with access restrictions, and then I looked at it. A high volume source IP jumped out right away and did indeed turn out to be our problem.
Apache's mod_status has a bad reputation as an information leak and a security issue, and as a result I think that a lot of people don't enabled it these days. Our example shows why you might want to reconsider that. Mod_status offers information that's fairly hard to get in any other way and that's very useful (or essential) when you need it, and it's definitely possible to enable it securely. Someday you will want to know who or what is bogging down your server (or at least what it's doing right now), and a live display of current requests is just the thing to tell you.
(This should not be surprising; live status is valuable for pretty much anything. Even when this sort of information can be approximated or reconstructed from logs, it takes extra time and effort.)