Wandering Thoughts archives

2023-08-03

Prometheus scrape failures can cause alerts to be 'resolved'

In Prometheus, alerts are created by alerting rules. The heart of an alerting rule is a PromQL query expression that describes what to alert on, like 'node_load1 > 10'. In casual discussion, I will talk about when the alert expression becomes true or goes false again, but this is not really what is going on.

Prometheus expressions act as filters, or to put it another way, we're doing set operations. What an alert rule expression really yields is some number of individual time series, each with its own set of labels. Each time series will become an alert (with its labels taken from the time series), either a pending alert or a firing alert depending on whether or not the alert rule has a 'for:' (see my entry on delays and timings for alerts). As long as the alert rule expression continues to produce that time series, the associated alert will stay active. When the alert rule expression stops producing that time series, the associated alert goes away (unless you've set 'keep_firing_for' on it).

Normally, the reason that the alert rule expression stops producing the time series is that the condition has stopped being true; for example, for 'node_load1 > 10', the load average on the host of the alert has gone below 10. However, another reason that this can happen is that the underlying metric goes away; suddenly there is no 'node_load1' for the host of the alert. One reason this can happen is if Prometheus failed to scrape the metrics source, for example if the source was busy crashing and restarting at the time of the scrape. A scrape failure will immediately mark all time series from the scrape target as stale, making them disappear as of that point (which stops Prometheus's normal look back in time for the most recent version of a time series).

This means that if you have a scrape failure for the host agent for a host, any alerts that come from host agent metrics will be 'resolved' (possibly notifying you about this) and then come back once the host agent can be scraped again. Alert rules without a 'for:' will reappear immediately; alert rules with a 'for:' will come back after their usual time.

It's possible that you can use the 'keep_firing_for' alert rule property to work around brief scrape interruptions with low impact. Delaying an alert resolving by 30 seconds or so (based on 15 second scrape intervals) isn't all that bad and in many situations you'll never really notice the delay. Longer scrape failures (for example for a machine rebooting) are probably going to have a visible impact on when your alerts are really resolved.

(The question I have about 'keep_firing_for' is how it interacts with alert rules that have a 'for:'.)

While I think it can be done, writing alert rules that keep firing over their metrics source failing to scrape looks like it will be sufficiently complex that it's not worth doing.

(I believe it would be even more complicated than the examples in my entry on "deadbands" for alerts; you'd want an extra condition for 'the alert is firing and the scrape source we care about is not up'. But even if this works, it could lead to stuck alerts, so you'd want a timeout. It's messy.)

sysadmin/PrometheusAlertsAndScrapeFailures written at 23:27:22;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.