A pattern for dealing with missing metrics in Prometheus in simple cases

April 18, 2019

Previously, I mentioned that Prometheus expressions are filters, which is part of Prometheus having a generally set-oriented view of the world. One of the consequences of this view is that you can quite often have expressions that give you a null result when you really want the result to be 0.

For example, let's suppose that you want a Grafana dashboard that includes a box that tells you how many Prometheus alerts are currently firing. When this happens, Prometheus exposes an ALERTS metric for each active alert, so on the surface you would count these up with:

count( ALERTS{alertstate="firing"} )

Then one day you don't have any firing alerts and your dashboard's box says 'N/A' or 'null' instead of the '0' that you want. This happens because 'ALERTS{alertstate="firing"}' matches nothing, so the result is a null set, and count() of a null set is a null result (or, technically, a null set).

The official recommended practice is to not have any metrics and metric label values that come and go; all of your metrics and label sets should be as constant as possible. As you can tell with the official Prometheus ALERTS metric, not even Prometheus itself actually fully follows this, so we need a way to deal with it.

My preferred way of dealing with this is to use 'or vector(0)' to make sure that I'm never dealing with a null set. The easiest thing to use this with is sum():

sum( ALERTS{alertstate="firing"} or vector(0) )

Using sum() has the useful property that the extra vector(0) element has no effect on the result. You can often use sum() instead of count() because many sporadic metrics have the value of '1' when they're present; it's the accepted way of creating what is essentially a boolean 'I am here' metric such as ALERTS.

If you're filtering for a specific value or value range, you can still use sum() instead of count() by using bool on the comparison:

sum( node_load1 > bool 10 or vector(0) )

If you're counting a value within a range, be careful where you put the bool; it needs to go on the last comparison. Eg:

sum( node_load1 > 5 < bool 10 or vector(0) )

If you have to use count() for more complicated reasons, the obvious approach is to subtract 1 from the result.

Unfortunately this approach starts breaking down rapidly when you want to do something more complicated. It's possible to compute a bare average over time using a subquery:

avg_over_time( (sum( ALERTS{alertstate="firing"} or vector(0) ))[6h:] )

(Averages over time of metrics that are 0 or 1, like up, are the classical way of figuring out things like 'what percentage of the time is my service down'.)

However I don't know how to do this if you want something like an average over time by alert name or by hostname. In both cases, even alerts that were present some of the time were not present all of the time, and they can't be filled in with 'vector(0)' because the labels don't match (and can't be made to match). Nor do I know of a good way to get the divisor for a manual averaging. Perhaps you would want to do an unnecessary subquery so you can exactly control the step and thus the divisor. This would be something like:

sum_over_time( (sum( ALERTS{alertstate="firing"} ) by (alertname))[6h:1m] ) / (6*60)

Experimentation suggests that this provides plausible results, at least. Hopefully it's not too inefficient. In Grafana, you need to write the subquerry as '[$__range:1m]' but the division as '($__range_s / 60)', because the Grafana template variable $__range includes the time units.

(See also Existential issues with metrics.)

Written on 18 April 2019.
« Private browsing mode versus a browser set to keep nothing on exit
One reason ed(1) was a good editor back in the days of V7 Unix »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Apr 18 00:39:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.