Putting some extra 'obvious' information into our temperature alerts

July 31, 2020

As part of our Prometheus system, we monitor the temperature in our machine rooms and wiring areas and send out alerts if the temperature gets what we consider to be 'too high'. The alert email generated for high temperatures is a slight variation of our general alert message; it has some generic framing, a specific message generated in Prometheus to describe the situation, and a convenient link to the Grafana dashboard for that temperature sensor.

We don't fix the AC systems in our machine rooms ourselves; for the most part, they're considered part of the building's infrastructure and are managed by the university people who look after the buildings. When there's an AC problem, part of what we do is to call those people to notify them of the problem, and there's a standard set of contacts. Probably this is all pretty normal for handling machine room AC.

Last week, we got a temperature alert (fortunately for a transient condition). As I started to deal with the issue, I once again had to remind myself of who we called and what their phone number was. We've had to deal with machine room AC issues often enough recently that I could trace through the logic of who it was, but not so much that I had the phone numbers memorized, so there was a certain amount of going through university websites and scanning some old email from past incidents and so on. As I was doing this, I slapped myself on the forehead, because the AC contact information should have been in the alert email.

This contact information is obvious in one sense, and it doesn't vary from sensor to sensor and alert to alert in the way that the specifics of the situation and even the link to the temperature sensor's dashboard do. But it's completely predictable that we're going to want the information when we get a temperature alert, and for all that it's obvious and standard, I don't generally deal with AC issues often enough to actually remember all of it (my co-workers may have better memories). This makes it a good thing to put in temperature alerts, so once I'd looked up everything (and the temperature had gone down again on its own), I updated the alerts to have a short footer that tells us who to call.

I've read various things on alerting that said alerts should ideally include links to runbooks. However I always interpreted it as 'runbooks for specific alerts' and the runbooks being for big things, not a little snippet of general information for a whole class of alerts. Of course in retrospect this is a bit silly.

My moral from this is that I should always try to think through what people getting the alert will immediately want, then consider putting it into the alert (either directly or perhaps on a web page). This is worth thinking about even if it feels like standard and obvious information, because what's obvious now may not be obvious when the alert goes off for the first time in six months.


Comments on this page:

By Todd at 2020-08-01 07:41:38:

On at least one occasion, I've left out information like that from an alert in order to intentionally slow down reporting of the situation because it's often transient. Some of my teammates at the time were junior and would maybe not know to wait a minute.

The standard that my company likes is: "It's 3 AM, you've just been woken up by an alert. The message should be clear about what is happening, why it is serious, and what to do about it. The what-to-do can be a link to the relevant page in the wiki or instructions to call particular people."

When alerts are likely to have false-positives, we add in a requirement that the test fail twice in a row before paging us, rather than alerting immediately.

Written on 31 July 2020.
« Putting IPMIs on a port isolated network to deal with shared network interfaces
Getting my head around the choice between sleeping and 'tickers' »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jul 31 23:23:15 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.