The convenience of multi-purpose monitoring (in Prometheus)

March 7, 2022

Recently I mentioned to someone that our TLS certificate expiry alerting is very convenient in that we mostly don't have to specifically do something to monitor TLS certificate expiry. Instead, we get it for free when we start monitoring a service that uses TLS. For example, if we add a check for a new HTTPS website here being up properly, that automatically adds TLS certificate expiry monitoring.

The specific mechanics of this are that the Prometheus Blackbox exporter that lets you do external health checks of services also exposes various metrics for TLS certificate expiry (the exact details are somewhat complicated; see my entries on what the metrics monitor and how they'd have reacted to a CA root expiring). The natural way to set up a TLS certificate expiry alert for almost everything is to use these already gathered Blackbox metrics.

The broad mechanics of this is what I will call multi-purpose monitoring. One probe, one monitoring mechanic, winds up serving multiple purposes and driving multiple alerts. Every time we add monitoring for one particular purpose, we get all of the other purposes along for free. We don't have to remember to specifically set up monitoring for TLS certificate expiry for all of our websites; it's enough that we monitor that they're up via HTTPS probes.

(This also shows the benefits of collecting and providing as many metrics as possible for a given probe, or at least as reasonable. All of this TLS alerting only happens because the Blackbox exporter automatically exposes all of those TLS metrics whenever it does TLS as part of a check.)

Multi-purpose monitoring is a great multiplier of effort (you add one check and get so much for it), and it's also a good way to make sure you don't overlook monitoring some aspect of something. If we have a service, the odds are relatively high that we'll remember we want to check some aspect of it, even if we don't remember everything we might need to check. If all of that monitoring comes along provided that we remember to add one piece of it, that's a good thing.

Of course there can be inconveniences as well, as illustrated by what would be happening if we were monitoring whether we could reach Facebook's (HTTPS) website. For some time, Facebook has been using TLS certificates that were very near to expiring, which means that by default we would be generating alerts about it. We only really want to get TLS certificate expiry alerts for websites (or more broadly endpoints) that we control or at least can influence, not everything we have an interest in seeing if we can (still) talk to.

(We deal with this by marking some hosts as 'external' and not triggering some sorts of alerts for such hosts. But we had to realize the potential issue and go out of our way to deal with it.)

Written on 07 March 2022.
« What sort of server it takes to build Firefox in four and a bit minutes
Hardware can be weird, server and USB keyboard edition »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 7 23:42:35 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.