A pattern for dealing with missing metrics in Prometheus in simple cases
Previously, I mentioned that Prometheus expressions are filters, which is part of Prometheus having a generally set-oriented view of the world. One of the consequences of this view is that you can quite often have expressions that give you a null result when you really want the result to be 0.
For example, let's suppose that you want a Grafana dashboard that
includes a box that tells you how many Prometheus alerts are currently
firing. When this happens, Prometheus exposes an ALERTS
metric
for each active alert, so on the surface you would count these up
with:
count( ALERTS{alertstate="firing"} )
Then one day you don't have any firing alerts and your dashboard's
box says 'N/A' or 'null' instead of the '0' that you want. This
happens because 'ALERTS{alertstate="firing"}
' matches nothing,
so the result is a null set, and count()
of a null set is a null
result (or, technically, a null set).
The official recommended practice is to not have any metrics and
metric label values that come and go; all of your metrics and label
sets should be as constant as possible. As you can tell with the
official Prometheus ALERTS
metric, not even Prometheus itself
actually fully follows this, so we need a way to deal with it.
My preferred way of dealing with this is to use 'or vector(0)
'
to make sure that I'm never dealing with a null set. The easiest
thing to use this with is sum()
:
sum( ALERTS{alertstate="firing"} or vector(0) )
Using sum()
has the useful property that the extra vector(0)
element has no effect on the result. You can often use sum()
instead of count()
because many sporadic metrics have the value
of '1
' when they're present; it's the accepted way of creating
what is essentially a boolean 'I am here' metric such as ALERTS
.
If you're filtering for a specific value or value range, you can
still use sum()
instead of count()
by using bool
on the
comparison:
sum( node_load1 > bool 10 or vector(0) )
If you're counting a value within a range, be careful where you put
the bool
; it needs to go on the last comparison. Eg:
sum( node_load1 > 5 < bool 10 or vector(0) )
If you have to use count()
for more complicated reasons, the
obvious approach is to subtract 1 from the result.
Unfortunately this approach starts breaking down rapidly when you want to do something more complicated. It's possible to compute a bare average over time using a subquery:
avg_over_time( (sum( ALERTS{alertstate="firing"} or vector(0) ))[6h:] )
(Averages over time of metrics that are 0 or 1, like up
, are the
classical way of figuring out things like 'what percentage of the
time is my service down'.)
However I don't know how to do this if you want something like an
average over time by alert name or by hostname. In both cases, even
alerts that were present some of the time were not present all of
the time, and they can't be filled in with 'vector(0)
' because
the labels don't match (and can't be made to match). Nor do I know
of a good way to get the divisor for a manual averaging. Perhaps
you would want to do an unnecessary subquery so you can exactly
control the step and thus the divisor. This would be something like:
sum_over_time( (sum( ALERTS{alertstate="firing"} ) by (alertname))[6h:1m] ) / (6*60)
Experimentation suggests that this provides plausible results, at
least. Hopefully it's not too inefficient. In Grafana, you need to
write the subquerry as '[$__range:1m]
' but the division as
'($__range_s / 60)
', because the Grafana template variable
$__range
includes the time units.
(See also Existential issues with metrics.)
|
|