Alerting on high level 'user stories' failing doesn't work in all setups

September 1, 2023

One of the things I've heard more than once about monitoring and alerts is that you should focus your testing and alerts on whether or not people can do things on your systems. This is sometimes described as alerting on symptoms not causes, or focusing your alerting around monitoring 'user stories' or 'user journeys' to see if they work. Approached from a (unit) testing mindset, you could say that you want to focus on integration or functional tests, such as 'can we send a mail to a test user and have them retrieve it within a reasonable time', instead of the monitoring equivalent of unit tests, like 'do our machines respond to ICMP pings'.

At first, I was going to write an entry about the practical challenges of doing very much of these end to end tests and alerts in our rather different environment. But the more I thought about it, the more I think that this 'user journeys' style of alerting is not entirely generally applicable, or at the least is very hard to apply in some environments. As an outsider, it seems that 'user journey' alerts work best in an environment where you have relatively few user services and these services don't have single points of failure in their implementation, and perhaps you have significant churn over time in how these services are implemented and operated. This often describes web applications, which also tend to come with convenient broad problem indicators in the form of monitoring for certain HTTP error results that signal internal issues.

Our environment is not like that. Interpreted broadly, we have many 'services', many single points of failure for entire services or parts of each service, and most services have simple and basically fixed implementations. For an example, consider the question of whether a user can log in via SSH to 'our environment'. We have a number of machines that the user could potentially log into, and their login to any particular machine might fail or hang if their home directory's fileserver was down or having problems. The coverage matrix for SSH login hosts times separate fileservers gets pretty large, and correspondingly it's not clear how useful it is to check if a single Unix login can log in to a single SSH host (even a popular host), especially if we're already checking to see that all of our hosts are up and responding on their SSH port.

(We do have some sort of end to end tests, in situations where we actually saw a unique failure in the past and want to guard against it happening again.)

There are failures that an end to end SSH test and alert would guard against that would not be caught by our current alerts; for example, if our password propagation system mangled /etc/passwd and /etc/shadow on the SSH login host. But we've never had this happen in more than a decade of operation, and an end to end SSH test would also have to deal with things like us deliberately locking out access during maintenance. There are an infinite number of stable doors we could be bolting, so I think it makes sense to pick carefully.

On the other hand, the reason that all of this makes sense for us is that we have a simple environment with generally predictable and static connections between components. In an environment where there is a constant change in how services are implemented, monitoring the implementation is a lot like unit testing the internal details of your code; at the best, you'll spend a lot of time updating the alerts (the unit tests) as the internal implementation details change. You're better off and have less churn if you test at the 'user journey' level, and perhaps you can specifically design your service to make that feasible and attractive.

(Even then, we can be blindsided by unexpected failures. The moderate saving grace is that these situations tend to make our alerts light up like a Christmas tree, so we at least know that something is going on and can figure out what. We've already learned the lesson that we don't want individual alerts when there's a mass problem, so an alert explosion is not as noisy as it sounds; we implement this with Alertmanager inhibitions.)

PS: In a service that theoretically has no single point of failure, using 'user journey' alerts also means that you don't have to try to predict in advance how much can go wrong in various places in your possibly many components before problems become user visible. This spares you from fun exercises like trying to figure out in advance how many database servers being how busy is 'too many' and 'too busy'.

(This doesn't particularly apply to us because our services aren't redundant in quite that way. We may have several login servers, but users have to pick one to SSH to, and if that one is down they will be unhappy.)


Comments on this page:

By Joseph at 2023-09-02 09:16:35:

So i think of alerting on symptoms as a way to reduce the risk of false positives. In other words this:

https://paulbellamy.com/2017/08/symptoms-not-causes

Alerting on causes is easy but all too frequently an engineer doesn't put the appropriate level of thinking in to when an alert could trigger when everything is fine.

I think of this as a principle or guideline, pragmatism matters. If i am managing a server, a disk running out of space has always been a bad sign. On the other hand cpu utilization alerts are a mass producer of false positive alerts

By Simon at 2023-09-04 23:10:34:

Don't get me wrong, I'm sure you know what works well for you. I'm not arguing you should change your monitoring. So this is just a comment about two details you mentioned:

[...] This often describes web applications [...]

I'm surprised that you take a web app as an example. Becuase for them it's pretty tricky to do proper end to end test for their user interfaces. Consider for example a mail services. Testing IMAP/POP/SMTP with some example user is rather easy to implement. On the other hand testing a webmailer is much more work (assuming you actual test the user interface not the API the UI is using). Especially if you expect significant churn on the user interface as you mention.

The coverage matrix for SSH login hosts times separate fileservers gets pretty large, [...]

So what? It's the same test just invoked with different variables. It's the same as for simpler "ICMP ping" or "SSH port is open" tests you mention.

Written on 01 September 2023.
« The technical merits of Wayland are mostly irrelevant
In practice, 'alerts' can have different meanings in different organizations »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 1 23:09:27 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.