Finding metrics that are missing labels in Prometheus (for alert metrics)
One of the things you can abuse metrics for in Prometheus is to
configure different alert levels, alert destinations, and so on for
different labels within the same metric, as I wrote about back in
my entry on using group_* vector matching for database lookups. The example in that entry used two metrics
the former showing the current available space and the latter
describing the alert levels and so on we want. Once we're using
metrics this way, one of the interesting questions we could ask is
what filesystems don't have a space alert set. As it turns out, we
can answer this relatively easily.
The first step is to be precise about what we want. Here, we want
to know what '
fs' labels are missing from
fs label is missing if it's not present in
but is present in
our_zfs_avail_gb. Since we're talking about
sets of labels, answering this requires some sort of set operation.
our_zfs_minfree_gb only has unique values for the
(ie, we only ever set one alert per filesystem), then this is
our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb
our_zfs_avail_gb metric generates our initial set of known
fs labels. Then we use UNLESS to subtract the set of all
labels that are present in
our_zfs_minfree_gb. We have to use
ON(fs)' because the only label we want to match on between the
two metrics is the
fs label itself.
However, this only works if
our_zfs_minfree_gb has no duplicate
fs labels. If it does (eg if different people can set their own
alerts for the same filesystem), we'd get a 'duplicate series' error
from this expression. The usual fix is to use a one to many match,
but those can't be combined with set operators
unless'. Instead we must get creative. Since all we care
about is the labels and not the values, we can use an aggregation
to give us a single series for each label on the right side of the
our_zfs_avail_gb UNLESS ON(fs) count(our_zfs_minfree_gb) by (fs)
As a side effect of what they do, all aggregation operators condense
multiple instances of a label value this way. It's very convenient
if you just want one instance of it; if you care about the resulting
value being one that exists in your underlying metrics you can use
You can obviously invert this operation to determine 'phantom' alerts,
alerts that have
fs labels that don't exist in your underlying metric.
That expression is:
count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs) our_zfs_avail_gb
(Here I'm assuimg
our_zfs_minfree_gb has duplicate
if it doesn't, you get a simpler expression.)
Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.
This general approach can be applied to any two metrics where some
label ought to be paired up across both. For instance, you could
cross-check that every
node_info_uname metric is matched by one
or more custom per-host informational metrics that your own software
is supposed to generate and expose through the node exporter's
(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)
Comments on this page:Written on 17 September 2019.