Thinking about what we want to be alerted about
Thinking about the broad subject of what we probably want for metrics, alerting, and so on leads pretty much straight to the obvious question of what do we want to be alerted about in the first place. It may seem peculiar to have to ask this, but we've sort of drifted into our current set of alerts and non-alerts over time (probably like many places). So our current alerts are some combination of things that seemed obvious at the time, things we added in reaction to stuff that happened, and things that haven't been annoying enough yet to turn off. This is probably not what we actually want to alert on or what we would alert on if we were starting over from scratch (which we probably are going to).
For many people the modern day answer here is pretty straightforward (eg); you alert if your user-facing service is down or significantly impaired so that it's visible to people. You may also alert on early warning signs that this is going to happen if you don't do something fairly soon. We're in an unusual environment in that we don't really run services like this, and in many cases there is nothing we can do about impaired user visible stuff. An obvious case is that if people fill up their filesystem, well, that's how much storage they bought. A less obvious case is that if our IMAP server is wallowing, it's quite possible that there's nothing wrong as such that we can fix, it's just slow because lots of people are using it.
My current thoughts are that we want to be alerted on the following things:
- Actual detected outages, for both hosts and some readily
checkable services. Almost all of our hosts are pets, so we definitely care if one goes
down and we're going to actively try to bring it back up
I expect this to be our primary source of alerts.
- Indicators that are good signs of actual outages that we can't
detect directly, such as a significant rise in machine room
temperature as a good sign of AC failure or serious problems.
- Things that very strongly presage actual outages. One example
for us is
/var/mailgetting too full, because if it fills up entirely a lot of email stuff will stop working very well.
(I'm not sure we have very many of the latter two types.)
If we use a weak sense of 'alerting' that is more 'informing' than 'poking us to do something', there may also be a use in alerting us about things that are sufficiently crucial but that we probably can't do anything about. If the department's administrative staff fill up all of their disk space, our hands are tied but at least we can know why they're suddenly having problems. This only works if the resulting alerts are infrequent.
(One possible answer here is that we should deal with 'be informed' cases by having some dashboards instead. Then if someone reports problems, we can turn to our dashboards and say 'ah, it looks like <X> happened'.)
Detecting that hosts are down is fairly straightforward; our traditional approach is to check to see if a host pings and if it answers on port 22 with a SSH banner. Detecting when services are down is potentially quite complicated, so I suspect that we want to limit ourselves to simple checks of straightforward things that are definitely indicators of problems rather than spending a lot of effort building end to end tests or figuring out excessively clever ways of, say, checking that our DHCP server is actually giving out DHCP leases. Checking whether all of our mail handling machines respond to SMTP connections is crude, but it has the twin virtues that it's easy and if it fails, we definitely know that we have a problem.
I'm not sure if this is less or more alerts than what we've currently wound up with, and in a sense it doesn't matter. What I'm most interesting in is having a framework where we can readily answer the question 'should we alert on <X>?', or at least have a general guide for it.
(One implication of our primary source of alerts being detected outages is that status checks are probably the most important thing to have good support for in a future alert system. Another one is that we need to figure out if and how we can detect certain sorts of outages, like a NFS server stopping responding instead of just getting really slow.)