Testing Prometheus alert conditions through subqueries
Suppose that you're considering adding a new alert condition to your Prometheus setup, for example alerting if a machine is using 75% or more of user CPU over five minutes. You could just add this alert in some testing mode and see if (and how often) it fires, but ideally you'd like to assess the alert condition beforehand to see if it seems like a useful alert. To start with, it'd be nice to know if it would actually fire at all. With Prometheus subqueries, we can actually answer a lot of these questions.
Let's start with the alert condition itself. The simple way to get
user CPU usage as a percent for a host, assuming that the host
label labels, well, hosts, is:
avg( rate(node_cpu_seconds_total{mode="user"}[5m]) ) by (host)
(We definitely don't want to use irate()
here.)
Let's call this our <RULE-EXPR>
, because shortening it is going to
make everything else shorter. At this point we can ask a very basic
question of 'what is the highest five-minute user CPU usage we see
across our fleet over the past day' and answer it using a subquery:
topk(30, max_over_time( (<RULE-EXPR>)[24h:] )
(Note that this will probably use a quite fine grained subquery resolution, since the default subquery resolution is your rule evaluation interval. You may not need so fine a resolution at this step, and it's likely to make things go faster if you use a coarser one, especially if you extend the time interval you're checking over to, say, a week.)
Just looking at the results will help tell us how many machines this would have triggered on (and how close to our boundary they were). But so far all we know is that our hypothetical alert would trigger some of the time. It would be nice to know more details.
To start with, we can see what percentage of the time our alert would be triggering for any particular machine. To start with, let's write out our alert condition:
avg( rate(node_cpu_seconds_total{mode="user"}[5m]) ) by (host) >= bool 0.75
This is just our <RULE-EXPR>
with the '>= bool 0.75
' condition
stuck on the end; call this our '<ALERT-COND>
' for short. Since
we used bool
, our alert condition is either 0 (if it's not true)
or 1 (if it is), and we can see the 0.0 to 1.0 average over time
with:
avg_over_time( (<ALERT-COND>)[24h:] )
Since we're using the default subquery resolution of the rule
evaluation interval, the answer here is accurate to what an actual
alerting rule would return. If you're trying to evaluate this over
a long time range you might need to approximate things with a coarser
subquery resolution. To give credit where it's due, I got this
avg_over_time
approach from this Robust Perception blog entry.
(You might want to use topk()
here, or at least stick a '> 0
'
on the end to filter out things for which our alert never fired.)
It would also be nice to know how many different times this alert would
fire, not just what percentage of the time it would be firing. This is
possible but more intricate. The first thing we need is a condition that
is 1 when the alert first fires and 0 otherwise. We can do this with
idelta()
on our alert condition:
idelta( (<ALERT-COND>)[1m:] ) > bool 0
(Here I'm assuming that one minute is enough time to guarantee at least two separate metric points for the metrics that our alert condition uses, so that we can reliably detect a transition.)
This works because when the alert condition goes from false to true,
the last two subquery points will be 0 and 1 and the idelta()
will thus be above 0. We can't just look for non-zero idelta()
,
because that would also count when the alert stopped firing (going
from 1 to 0, making idelta()
's result negative). We're using
bool
here to make this either 0 or 1, because our next step is
to count how many times this happens:
sum_over_time( (idelta( (<ALERT-COND>)[1m:] ) > bool 0)[24h:] )
(Again you may want to use topk()
or at least '> 0
'.)
This is not the only way to write this expression; we could also
leave out the bool
and use count_over_time
:
count_over_time( (idelta( (<ALERT-COND>)[1m:] ) > 0)[24h:] )
This version is probably more efficient since it should generate
fewer metric points in the outer subquery, since it drops all metric
points where the idelta()
hasn't detected the alert firing. It
also automatically drops things where the alert hasn't fired at
all. It's a bit more tricky, though, since it's using the side
effect of dropping metric points where the condition isn't true
and then counting how many remain.
(Unfortunately I don't think there's any way to find out when these
nominal alert triggers happened. The information is there inside
Prometheus, as we've seen, but
there's no way to get it out without scraping it from the API. PromQL's
timestamp()
function only works on instant vectors, not range
vectors.)
|
|