Testing Prometheus alert conditions through subqueries

March 11, 2019

Suppose that you're considering adding a new alert condition to your Prometheus setup, for example alerting if a machine is using 75% or more of user CPU over five minutes. You could just add this alert in some testing mode and see if (and how often) it fires, but ideally you'd like to assess the alert condition beforehand to see if it seems like a useful alert. To start with, it'd be nice to know if it would actually fire at all. With Prometheus subqueries, we can actually answer a lot of these questions.

Let's start with the alert condition itself. The simple way to get user CPU usage as a percent for a host, assuming that the host label labels, well, hosts, is:

avg( rate(node_cpu_seconds_total{mode="user"}[5m]) ) by (host)

(We definitely don't want to use irate() here.)

Let's call this our <RULE-EXPR>, because shortening it is going to make everything else shorter. At this point we can ask a very basic question of 'what is the highest five-minute user CPU usage we see across our fleet over the past day' and answer it using a subquery:

topk(30, max_over_time( (<RULE-EXPR>)[24h:] )

(Note that this will probably use a quite fine grained subquery resolution, since the default subquery resolution is your rule evaluation interval. You may not need so fine a resolution at this step, and it's likely to make things go faster if you use a coarser one, especially if you extend the time interval you're checking over to, say, a week.)

Just looking at the results will help tell us how many machines this would have triggered on (and how close to our boundary they were). But so far all we know is that our hypothetical alert would trigger some of the time. It would be nice to know more details.

To start with, we can see what percentage of the time our alert would be triggering for any particular machine. To start with, let's write out our alert condition:

avg( rate(node_cpu_seconds_total{mode="user"}[5m]) ) by (host) >= bool 0.75

This is just our <RULE-EXPR> with the '>= bool 0.75' condition stuck on the end; call this our '<ALERT-COND>' for short. Since we used bool, our alert condition is either 0 (if it's not true) or 1 (if it is), and we can see the 0.0 to 1.0 average over time with:

avg_over_time( (<ALERT-COND>)[24h:] )

Since we're using the default subquery resolution of the rule evaluation interval, the answer here is accurate to what an actual alerting rule would return. If you're trying to evaluate this over a long time range you might need to approximate things with a coarser subquery resolution. To give credit where it's due, I got this avg_over_time approach from this Robust Perception blog entry.

(You might want to use topk() here, or at least stick a '> 0' on the end to filter out things for which our alert never fired.)

It would also be nice to know how many different times this alert would fire, not just what percentage of the time it would be firing. This is possible but more intricate. The first thing we need is a condition that is 1 when the alert first fires and 0 otherwise. We can do this with idelta() on our alert condition:

idelta( (<ALERT-COND>)[1m:] ) > bool 0

(Here I'm assuming that one minute is enough time to guarantee at least two separate metric points for the metrics that our alert condition uses, so that we can reliably detect a transition.)

This works because when the alert condition goes from false to true, the last two subquery points will be 0 and 1 and the idelta() will thus be above 0. We can't just look for non-zero idelta(), because that would also count when the alert stopped firing (going from 1 to 0, making idelta()'s result negative). We're using bool here to make this either 0 or 1, because our next step is to count how many times this happens:

sum_over_time( (idelta( (<ALERT-COND>)[1m:] ) > bool 0)[24h:] )

(Again you may want to use topk() or at least '> 0'.)

This is not the only way to write this expression; we could also leave out the bool and use count_over_time:

count_over_time( (idelta( (<ALERT-COND>)[1m:] ) > 0)[24h:] )

This version is probably more efficient since it should generate fewer metric points in the outer subquery, since it drops all metric points where the idelta() hasn't detected the alert firing. It also automatically drops things where the alert hasn't fired at all. It's a bit more tricky, though, since it's using the side effect of dropping metric points where the condition isn't true and then counting how many remain.

(Unfortunately I don't think there's any way to find out when these nominal alert triggers happened. The information is there inside Prometheus, as we've seen, but there's no way to get it out without scraping it from the API. PromQL's timestamp() function only works on instant vectors, not range vectors.)

Written on 11 March 2019.
« What the default query step is for Prometheus subqueries
An easy optimization for restricted multi-metric queries in Prometheus »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 11 20:49:06 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.