Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE
In Prometheus, active alerts are exposed
through two metrics, the reasonably documented ALERTS
and the
under-documented new metric ALERTS_FOR_STATE
. Both metrics have
all of the labels of the alert (although not its annotations), and
also an 'alertname
' label; the ALERTS
metric also has an
additional 'alertstate
' metric. The value of the ALERTS
metric
is always '1', while the value of ALERTS_FOR_STATE
is the Unix
timestamp of when the alert rule expression started being true; for
rules with 'for
' delays, this means that it is the timestamp when
they started being 'pending
', not when they became 'firing
'
(see this rundown of the timeline of an alert).
(The ALERTS_FOR_STATE
metric is an internal one added in 2.4.0
to persist the state of alerts so that 'for
' delays work over
Prometheus restarts. See "Persist 'for' State of Alerts" for more
details, and also Prometheus issue #422. Because of
this, it's not exported from the local Prometheus and may not be
useful to you in clustered or federated setups.)
The ALERTS_FOR_STATE
metric is quite useful if you want to know
the start time of an alert, because this information is otherwise
pretty much unavailable through PromQL.
The necessary information is sort of in Prometheus's time series
database, but PromQL does not provide any functions to extract it.
Also, unfortunately there is no good way to see when an alert ends
even with ALERTS_FOR_STATE
.
(In both cases the core problem is that alerts that are not firing don't exist as metrics at all. There are some things you can do with missing metrics, but there is no good way to see in general when a metric appears or disappears. In some cases you can look at the results of manually evaluating the underlying alert rule expression, but in other cases even this will have a null value when it is not active.)
We can do some nice things with ALERTS_FOR_STATE
, though. To
start with, we can calculate how long each current alert has been
active, which is just the current time minus when it started:
time() - ALERTS_FOR_STATE
If we want to restrict this to alerts that are actually firing at the moment, instead of just being pending, we can write it as:
(time() - ALERTS_FOR_STATE) and ignoring(alertstate) ALERTS{alertstate="firing"}
(We must ignore the 'alertstate
' label because the ALERTS_FOR_STATE
metric doesn't have it.)
You might use this in a dashboard where you want to see which alerts are new and which are old.
A more involved query is one to tell us the longest amount of time that a firing alert has been active over the past time interval. The full version of this is:
max_over_time( ( (time() - ALERTS_FOR_STATE) and ignoring(alertstate) ALERTS{alertstate="firing"} )[7d:] )
The core of this is the expression we already saw, and we evaluate
it over the past 7 days, but until I thought about things it wasn't
clear why this gives us the longest amount of time for any particular
alert. What is going on is that while an alert is active,
ALERTS_FOR_STATE
's value stays constant while time()
is
counting up, because it is evaluated at each step of the subquery.
The maximum value of 'time() - ALERTS_FOR_STATE
' happens right
before the alert ceases to be active and its ALERTS_FOR_STATE
metric disappears. Using max_over_time
captures this maximum
value for us.
(If the same alert is active several times over the past seven days, we only get the longest single time. There is no good way to see how long each individual incident lasted.)
We can exploit the fact that ALERTS_FOR_STATE
has a different
value each time an alert activates to count how many different
alerts activated over the course of some range. The simplest way
to do this is:
changes( ALERTS_FOR_STATE[7d] ) + 1
We have to add one because going from not existing to existing is
not counted as a change in value for the purpose of changes()
,
so an alert that only fired once will be reported as having 0 changes
in its ALERTS_FOR_STATE
value. I will leave it as an exercise to
the reader to extend this to only counting how many times alerts
fired, ignoring alerts that only became pending and then went away again
(as might happen repeatedly if you have alerts with deliberately long
'for
' delays).
(This entry was sparked by a recent prometheus-users thread, especially Julien Pivotto's suggestion.)
|
|