2023-08-03
Prometheus scrape failures can cause alerts to be 'resolved'
In Prometheus, alerts are created by
alerting rules.
The heart of an alerting rule is a PromQL
query expression that describes what to alert on, like
'node_load1 > 10
'. In casual discussion, I will talk about when
the alert expression becomes true or goes false again, but this is
not really what is going on.
Prometheus expressions act as filters,
or to put it another way, we're doing set operations. What an alert
rule expression really yields is some number of individual time
series, each with its own set of labels. Each time series will
become an alert (with its labels taken from the time series), either a pending alert or a
firing alert depending on whether or not the alert rule has a
'for:
' (see my entry on delays and timings for alerts). As long as the alert rule expression
continues to produce that time series, the associated alert will
stay active. When the alert rule expression stops producing that
time series, the associated alert goes away (unless you've set
'keep_firing_for
' on it).
Normally, the reason that the alert rule expression stops producing
the time series is that the condition has stopped being true; for
example, for 'node_load1 > 10
', the load average on the host
of the alert has gone below 10. However, another reason that this
can happen is that the underlying metric goes away; suddenly there
is no 'node_load1
' for the host of the alert. One reason this
can happen is if Prometheus failed to scrape the metrics source,
for example if the source was busy crashing and restarting at the
time of the scrape. A scrape
failure will immediately mark all time series from the scrape target
as stale, making them disappear as of that point (which stops
Prometheus's normal look back in time for the most recent version
of a time series).
This means that if you have a scrape failure for the host agent for a host, any alerts
that come from host agent metrics will be 'resolved' (possibly
notifying you about this) and then come back once the host agent
can be scraped again. Alert rules without a 'for:
' will reappear
immediately; alert rules with a 'for:
' will come back after their
usual time.
It's possible that you can use the 'keep_firing_for
' alert
rule property to work around brief
scrape interruptions with low impact. Delaying an alert resolving
by 30 seconds or so (based on 15 second scrape intervals) isn't all
that bad and in many situations you'll never really notice the
delay. Longer scrape failures (for example for a machine rebooting)
are probably going to have a visible impact on when your alerts are
really resolved.
(The question I have about 'keep_firing_for
' is how it interacts
with alert rules that have a 'for:
'.)
While I think it can be done, writing alert rules that keep firing over their metrics source failing to scrape looks like it will be sufficiently complex that it's not worth doing.
(I believe it would be even more complicated than the examples in my entry on "deadbands" for alerts; you'd want an extra condition for 'the alert is firing and the scrape source we care about is not up'. But even if this works, it could lead to stuck alerts, so you'd want a timeout. It's messy.)