Finding metrics that are missing labels in Prometheus (for alert metrics)
One of the things you can abuse metrics for in Prometheus is to
configure different alert levels, alert destinations, and so on for
different labels within the same metric, as I wrote about back in
my entry on using group_* vector matching for database lookups. The example in that entry used two metrics
for filesystems, our_zfs_avail_gb
and our_zfs_minfree_gb
,
the former showing the current available space and the latter
describing the alert levels and so on we want. Once we're using
metrics this way, one of the interesting questions we could ask is
what filesystems don't have a space alert set. As it turns out, we
can answer this relatively easily.
The first step is to be precise about what we want. Here, we want
to know what 'fs
' labels are missing from our_zfs_minfree_gb
.
A fs
label is missing if it's not present in our_zfs_minfree_gb
but is present in our_zfs_avail_gb
. Since we're talking about
sets of labels, answering this requires some sort of set operation.
If our_zfs_minfree_gb
only has unique values for the fs
label
(ie, we only ever set one alert per filesystem), then this is
relatively straightforward:
our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb
The our_zfs_avail_gb
metric generates our initial set of known
fs
labels. Then we use UNLESS to subtract the set of all fs
labels that are present in our_zfs_minfree_gb
. We have to use
'ON(fs)
' because the only label we want to match on between the
two metrics is the fs
label itself.
However, this only works if our_zfs_minfree_gb
has no duplicate
fs
labels. If it does (eg if different people can set their own
alerts for the same filesystem), we'd get a 'duplicate series' error
from this expression. The usual fix is to use a one to many match,
but those can't be combined with set operators
like 'unless
'. Instead we must get creative. Since all we care
about is the labels and not the values, we can use an aggregation
operation
to give us a single series for each label on the right side of the
expression:
our_zfs_avail_gb UNLESS ON(fs) count(our_zfs_minfree_gb) by (fs)
As a side effect of what they do, all aggregation operators condense
multiple instances of a label value this way. It's very convenient
if you just want one instance of it; if you care about the resulting
value being one that exists in your underlying metrics you can use
max()
or min()
.
You can obviously invert this operation to determine 'phantom' alerts,
alerts that have fs
labels that don't exist in your underlying metric.
That expression is:
count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs) our_zfs_avail_gb
(Here I'm assuimg our_zfs_minfree_gb
has duplicate fs
labels;
if it doesn't, you get a simpler expression.)
Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.
This general approach can be applied to any two metrics where some
label ought to be paired up across both. For instance, you could
cross-check that every node_info_uname
metric is matched by one
or more custom per-host informational metrics that your own software
is supposed to generate and expose through the node exporter's
textfile collector.
(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)
|
|