2024-02-28
Detecting absent Prometheus metrics without knowing their labels
When you have a Prometheus setup, one of the things you sooner or later worry about is important metrics quietly going missing because they're not being reported any more. There can be many reasons for metrics disappearing on you; for example, a network interface you expect to be at 10G speeds may not be there at all any more, because it got renamed at some point, so now you're not making sure the new name is at 10G.
(This happened to us with one machine's network interface, although I'm not sure exactly how except that it involves the depths of PCIe enumeration.)
The standard Prometheus feature for this is the 'absent()
'
function, or sometimes absent_over_time()
.
However, both of these have the problem that because of Prometheus's
data model, you need to know at least some unique labels that your
metrics are supposed to have. Without labels, all you can detect
is a total disappearance of the metric at all, if nothing at all
is reporting the metric. If you want to be alerted when some machine
stops reporting a metric, you need to list all of the sources that
should have the metric (following a pattern we've seen before):
absent(metric{host="a", device="em0"}) or absent(metric{host="b", device="eno1"}) or absent(metric{host="c", device="eth2"})
Sometimes you don't know all of the label values that your metric
be present with (or it's tedious to list all of them and keep them
up to date), and it's good enough to get a notification if a metric
disappears when it was previously there (for a particular set of
labels). For example, you might have an assortment of scripts that
put their success results to somewhere and you don't want to have
to keep a list of all of the scripts, but you do want to detect
when a script stops reporting its metrics. In this case we can use
'offset
'
to check current metrics against old metrics. The simplest pattern
is:
your_metric offset 1h unless your_metric
If the metric was there an hour ago and isn't there now, this will generate the metric as it was an hour ago (with the labels it had then), and you can use that to drive an alert (or at least a notification). If there are labels that might naturally change over time in your_metric, you can exclude them with 'unless ignoring (...)' or use 'unless on (...)' for a very focused result.
As written this has the drawback that it only looks at what versions of the metric were there exactly an hour ago. We can do better by using an *_over_time() function, for example:
max_over_time( your_metric[4h] ) offset 1h unless your_metric
Now if your metric existed (with some labels) at any point between five hours ago and one hour ago, and doesn't exist now, this expression will give you a result and you can alert on that. Since we're using *_over_time(), you can also leave off the 'offset 1h' and just extend the time range, and then maybe extend the other time range too:
max_over_time( your_metric[12h] ) unless max_over_time( your_metric[20m] )
This expression will give you a result if your_metric has been present (with a given set of labels) at some point in the last 12 hours but has not been present within the last 20 minutes.
(You'd pick the particular *_over_time() function to use depending on what use, if any, you have for the value of the metric in your alert. If you have no particular use for the value (or you expect the value to be a constant), either max or min are efficient for Prometheus to compute.)
All of these clever versions have a drawback, which is that after enough time has gone by they shut off on their own. Once the metric has been missing for at least an hour or five hours or 12 hours or however long, even the first part of the expression has nothing and you get no results and no alert. So this is more of a 'notification' than a persistent 'alert'. That's unfortunately the best you can really do. If you need a persistent alert that will last until you take it out of your alert rules, you need to use absent() and explicitly specify the labels you expect and require.