Alerting on high level 'user stories' failing doesn't work in all setups
One of the things I've heard more than once about monitoring and alerts is that you should focus your testing and alerts on whether or not people can do things on your systems. This is sometimes described as alerting on symptoms not causes, or focusing your alerting around monitoring 'user stories' or 'user journeys' to see if they work. Approached from a (unit) testing mindset, you could say that you want to focus on integration or functional tests, such as 'can we send a mail to a test user and have them retrieve it within a reasonable time', instead of the monitoring equivalent of unit tests, like 'do our machines respond to ICMP pings'.
At first, I was going to write an entry about the practical challenges of doing very much of these end to end tests and alerts in our rather different environment. But the more I thought about it, the more I think that this 'user journeys' style of alerting is not entirely generally applicable, or at the least is very hard to apply in some environments. As an outsider, it seems that 'user journey' alerts work best in an environment where you have relatively few user services and these services don't have single points of failure in their implementation, and perhaps you have significant churn over time in how these services are implemented and operated. This often describes web applications, which also tend to come with convenient broad problem indicators in the form of monitoring for certain HTTP error results that signal internal issues.
Our environment is not like that. Interpreted broadly, we have many 'services', many single points of failure for entire services or parts of each service, and most services have simple and basically fixed implementations. For an example, consider the question of whether a user can log in via SSH to 'our environment'. We have a number of machines that the user could potentially log into, and their login to any particular machine might fail or hang if their home directory's fileserver was down or having problems. The coverage matrix for SSH login hosts times separate fileservers gets pretty large, and correspondingly it's not clear how useful it is to check if a single Unix login can log in to a single SSH host (even a popular host), especially if we're already checking to see that all of our hosts are up and responding on their SSH port.
(We do have some sort of end to end tests, in situations where we actually saw a unique failure in the past and want to guard against it happening again.)
There are failures that an end to end SSH test and alert would guard against that would not be caught by our current alerts; for example, if our password propagation system mangled /etc/passwd and /etc/shadow on the SSH login host. But we've never had this happen in more than a decade of operation, and an end to end SSH test would also have to deal with things like us deliberately locking out access during maintenance. There are an infinite number of stable doors we could be bolting, so I think it makes sense to pick carefully.
On the other hand, the reason that all of this makes sense for us is that we have a simple environment with generally predictable and static connections between components. In an environment where there is a constant change in how services are implemented, monitoring the implementation is a lot like unit testing the internal details of your code; at the best, you'll spend a lot of time updating the alerts (the unit tests) as the internal implementation details change. You're better off and have less churn if you test at the 'user journey' level, and perhaps you can specifically design your service to make that feasible and attractive.
(Even then, we can be blindsided by unexpected failures. The moderate saving grace is that these situations tend to make our alerts light up like a Christmas tree, so we at least know that something is going on and can figure out what. We've already learned the lesson that we don't want individual alerts when there's a mass problem, so an alert explosion is not as noisy as it sounds; we implement this with Alertmanager inhibitions.)
PS: In a service that theoretically has no single point of failure, using 'user journey' alerts also means that you don't have to try to predict in advance how much can go wrong in various places in your possibly many components before problems become user visible. This spares you from fun exercises like trying to figure out in advance how many database servers being how busy is 'too many' and 'too busy'.
(This doesn't particularly apply to us because our services aren't redundant in quite that way. We may have several login servers, but users have to pick one to SSH to, and if that one is down they will be unhappy.)