== Exploring the start time of Prometheus alerts via ((ALERTS_FOR_STATE)) In [[Prometheus https://prometheus.io/]], active alerts are exposed through two metrics, the reasonably documented ((ALERTS)) and the under-documented new metric ((ALERTS_FOR_STATE)). Both metrics have all of the labels of the alert (although not its annotations), and also an '_alertname_' label; the ((ALERTS)) metric also has an additional '_alertstate_' metric. The value of the ((ALERTS)) metric is always '1', while the value of ((ALERTS_FOR_STATE)) is the Unix timestamp of when the alert rule expression started being true; for rules with '_for_' delays, this means that it is the timestamp when they started being '_pending_', not when they became '_firing_' (see [[this rundown of the timeline of an alert PrometheusAlertDelays]]). (The ((ALERTS_FOR_STATE)) metric is an internal one added in 2.4.0 to persist the state of alerts so that '_for_' delays work over Prometheus restarts. See [["Persist 'for' State of Alerts" https://ganeshvernekar.com/gsoc-2018/persist-for-state/]] for more details, and also [[Prometheus issue #422 https://github.com/prometheus/prometheus/issues/422]]. Because of this, it's not exported from the local Prometheus and may not be useful to you in clustered or federated setups.) The ((ALERTS_FOR_STATE)) metric is quite useful if you want to know the start time of an alert, because this information is otherwise pretty much unavailable through [[PromQL https://prometheus.io/docs/prometheus/latest/querying/basics/]]. The necessary information is sort of in Prometheus's time series database, but PromQL does not provide any functions to extract it. Also, unfortunately there is no good way to see when an alert ends even with ((ALERTS_FOR_STATE)). (In both cases the core problem is that alerts that are not firing don't exist as metrics at all. [[There are some things you can do with missing metrics PrometheusMissingMetricsPattern]], but there is no good way to see in general when a metric appears or disappears. In some cases you can look at the results of manually evaluating the underlying alert rule expression, but in other cases even this will have a null value when it is not active.) We can do some nice things with ((ALERTS_FOR_STATE)), though. To start with, we can calculate how long each current alert has been active, which is just the current time minus when it started: .pn prewrap on > time() - ALERTS_FOR_STATE If we want to restrict this to alerts that are actually firing at the moment, instead of just being pending, we can write it as: > (time() - ALERTS_FOR_STATE) > and ignoring(alertstate) ALERTS{alertstate="firing"} (We must ignore the '_alertstate_' label because the ((ALERTS_FOR_STATE)) metric doesn't have it.) You might use this in a dashboard where you want to see which alerts are new and which are old. A more involved query is one to tell us the longest amount of time that a firing alert has been active over the past time interval. The full version of this is: > max_over_time( ( (time() - ALERTS_FOR_STATE) > and ignoring(alertstate) > ALERTS{alertstate="firing"} > )[7d:] ) The core of this is the expression we already saw, and we evaluate it over the past 7 days, but until I thought about things it wasn't clear why this gives us the longest amount of time for any particular alert. What is going on is that while an alert is active, ((ALERTS_FOR_STATE))'s value stays constant while _time()_ is counting up, because it is evaluated at each step of the subquery. The maximum value of '((time() - ALERTS_FOR_STATE))' happens right before the alert ceases to be active and its ((ALERTS_FOR_STATE)) metric disappears. Using ((max_over_time)) captures this maximum value for us. (If the same alert is active several times over the past seven days, we only get the longest single time. There is no good way to see how long each individual incident lasted.) We can exploit the fact that ((ALERTS_FOR_STATE)) has a different value each time an alert activates to count how many different alerts activated over the course of some range. The simplest way to do this is: > changes( ALERTS_FOR_STATE[7d] ) + 1 We have to add one because going from not existing to existing is not counted as a change in value for the purpose of [[_changes()_ https://prometheus.io/docs/prometheus/latest/querying/functions/#changes]], so an alert that only fired once will be reported as having 0 changes in its ((ALERTS_FOR_STATE)) value. I will leave it as an exercise to the reader to extend this to only counting how many times alerts fired, ignoring alerts that only became pending and then went away again (as might happen repeatedly if you have alerts with deliberately long '_for_' delays). (This entry was sparked by [[a recent prometheus-users thread https://groups.google.com/forum/#!topic/prometheus-users/XDJ7rzdnReg]], especially Julien Pivotto's suggestion.)