Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE

June 2, 2019

In Prometheus, active alerts are exposed through two metrics, the reasonably documented ALERTS and the under-documented new metric ALERTS_FOR_STATE. Both metrics have all of the labels of the alert (although not its annotations), and also an 'alertname' label; the ALERTS metric also has an additional 'alertstate' metric. The value of the ALERTS metric is always '1', while the value of ALERTS_FOR_STATE is the Unix timestamp of when the alert rule expression started being true; for rules with 'for' delays, this means that it is the timestamp when they started being 'pending', not when they became 'firing' (see this rundown of the timeline of an alert).

(The ALERTS_FOR_STATE metric is an internal one added in 2.4.0 to persist the state of alerts so that 'for' delays work over Prometheus restarts. See "Persist 'for' State of Alerts" for more details, and also Prometheus issue #422. Because of this, it's not exported from the local Prometheus and may not be useful to you in clustered or federated setups.)

The ALERTS_FOR_STATE metric is quite useful if you want to know the start time of an alert, because this information is otherwise pretty much unavailable through PromQL. The necessary information is sort of in Prometheus's time series database, but PromQL does not provide any functions to extract it. Also, unfortunately there is no good way to see when an alert ends even with ALERTS_FOR_STATE.

(In both cases the core problem is that alerts that are not firing don't exist as metrics at all. There are some things you can do with missing metrics, but there is no good way to see in general when a metric appears or disappears. In some cases you can look at the results of manually evaluating the underlying alert rule expression, but in other cases even this will have a null value when it is not active.)

We can do some nice things with ALERTS_FOR_STATE, though. To start with, we can calculate how long each current alert has been active, which is just the current time minus when it started:

time() - ALERTS_FOR_STATE

If we want to restrict this to alerts that are actually firing at the moment, instead of just being pending, we can write it as:

    (time() - ALERTS_FOR_STATE)
and ignoring(alertstate) ALERTS{alertstate="firing"}

(We must ignore the 'alertstate' label because the ALERTS_FOR_STATE metric doesn't have it.)

You might use this in a dashboard where you want to see which alerts are new and which are old.

A more involved query is one to tell us the longest amount of time that a firing alert has been active over the past time interval. The full version of this is:

max_over_time( ( (time() - ALERTS_FOR_STATE)
                  and ignoring(alertstate)
                         ALERTS{alertstate="firing"}
               )[7d:] )

The core of this is the expression we already saw, and we evaluate it over the past 7 days, but until I thought about things it wasn't clear why this gives us the longest amount of time for any particular alert. What is going on is that while an alert is active, ALERTS_FOR_STATE's value stays constant while time() is counting up, because it is evaluated at each step of the subquery. The maximum value of 'time() - ALERTS_FOR_STATE' happens right before the alert ceases to be active and its ALERTS_FOR_STATE metric disappears. Using max_over_time captures this maximum value for us.

(If the same alert is active several times over the past seven days, we only get the longest single time. There is no good way to see how long each individual incident lasted.)

We can exploit the fact that ALERTS_FOR_STATE has a different value each time an alert activates to count how many different alerts activated over the course of some range. The simplest way to do this is:

changes( ALERTS_FOR_STATE[7d] ) + 1

We have to add one because going from not existing to existing is not counted as a change in value for the purpose of changes(), so an alert that only fired once will be reported as having 0 changes in its ALERTS_FOR_STATE value. I will leave it as an exercise to the reader to extend this to only counting how many times alerts fired, ignoring alerts that only became pending and then went away again (as might happen repeatedly if you have alerts with deliberately long 'for' delays).

(This entry was sparked by a recent prometheus-users thread, especially Julien Pivotto's suggestion.)

Written on 02 June 2019.
« I haven't customized my Vim setup and I'm not sure I should try to (yet)
Almost all of our OmniOS machines are now out of production »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jun 2 22:15:26 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.