Finding metrics that are missing labels in Prometheus (for alert metrics)

September 17, 2019

One of the things you can abuse metrics for in Prometheus is to configure different alert levels, alert destinations, and so on for different labels within the same metric, as I wrote about back in my entry on using group_* vector matching for database lookups. The example in that entry used two metrics for filesystems, our_zfs_avail_gb and our_zfs_minfree_gb, the former showing the current available space and the latter describing the alert levels and so on we want. Once we're using metrics this way, one of the interesting questions we could ask is what filesystems don't have a space alert set. As it turns out, we can answer this relatively easily.

The first step is to be precise about what we want. Here, we want to know what 'fs' labels are missing from our_zfs_minfree_gb. A fs label is missing if it's not present in our_zfs_minfree_gb but is present in our_zfs_avail_gb. Since we're talking about sets of labels, answering this requires some sort of set operation.

If our_zfs_minfree_gb only has unique values for the fs label (ie, we only ever set one alert per filesystem), then this is relatively straightforward:

our_zfs_avail_gb UNLESS ON(fs) our_zfs_minfree_gb

The our_zfs_avail_gb metric generates our initial set of known fs labels. Then we use UNLESS to subtract the set of all fs labels that are present in our_zfs_minfree_gb. We have to use 'ON(fs)' because the only label we want to match on between the two metrics is the fs label itself.

However, this only works if our_zfs_minfree_gb has no duplicate fs labels. If it does (eg if different people can set their own alerts for the same filesystem), we'd get a 'duplicate series' error from this expression. The usual fix is to use a one to many match, but those can't be combined with set operators like 'unless'. Instead we must get creative. Since all we care about is the labels and not the values, we can use an aggregation operation to give us a single series for each label on the right side of the expression:

our_zfs_avail_gb UNLESS ON(fs)
   count(our_zfs_minfree_gb) by (fs)

As a side effect of what they do, all aggregation operators condense multiple instances of a label value this way. It's very convenient if you just want one instance of it; if you care about the resulting value being one that exists in your underlying metrics you can use max() or min().

You can obviously invert this operation to determine 'phantom' alerts, alerts that have fs labels that don't exist in your underlying metric. That expression is:

count(our_zfs_minfree_gb) by (fs) UNLESS ON(fs)
   our_zfs_avail_gb

(Here I'm assuimg our_zfs_minfree_gb has duplicate fs labels; if it doesn't, you get a simpler expression.)

Such phantom alerts might come about from typos, filesystems that haven't been created yet but you've pre-set alert levels for, or filesystems that have been removed since alert levels were set for them.

This general approach can be applied to any two metrics where some label ought to be paired up across both. For instance, you could cross-check that every node_info_uname metric is matched by one or more custom per-host informational metrics that your own software is supposed to generate and expose through the node exporter's textfile collector.

(This entry was sparked by a prometheus-users mailing list thread that caused me to work out the specifics of how to do this.)

Written on 17 September 2019.
« The problem of 'triangular' Network Address Translation
Converting a Go pointer to an integer doesn't quite do what it looks like »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 17 00:12:27 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.