The Prometheus scrape interval mistake people keep making

March 30, 2024

Prometheus gathers metrics by scraping metrics exporters every so often, which means that it has a concept of the scrape interval, how frequently it should scrape a metrics source (a target). Prometheus also has recording rules and alerting rules, both of which have to be evaluated every so often; these also have an evaluation interval. Every so often, someone shows up on the Prometheus mailing list to say, more or less, 'I have a source of metrics that only updates every half hour, so I set my scrape interval to half an hour and everything went mysteriously wrong'.

The reason everything goes wrong if you set a long scrape interval (or a rule evaluation interval) is that Prometheus has an under-documented idea that metric samples go stale after a while. Or to put it another way, when you make a Prometheus query, it only looks back so far to find 'the current value of a metric'. This period is five minutes by default, and changing it is not at all obvious. If you scrape a metric too slowly, the most recent sample will routinely go stale and stop being visible to your queries and alerts. If you scrape something only every half an hour, your metrics from that scrape will be good for five minutes and then stale (and invisible) for the next 25 (more or less). This is unlikely to be what you want.

(Because recording rules and alerting rules create metrics, their evaluation intervals are also subject to this issue. This is pretty clear with recording rules, since their whole purpose is to create new metrics, but isn't as obvious with alerting rules.)

Unfortunately, Prometheus does nothing to stop you from configuring this by accident or ignorance, and people routinely do. You can set a scrape interval of ten minutes, or a half an hour, or an hour, and get not so much as a vague warning from Prometheus when it checks your configuration and starts up. Nor is there so much as a caution about this in the configuration documentation, at least currently.

(The usual safe recommendation is that your scrape interval be no longer than about two minutes, so that you can miss one scrape without metrics going stale.)

If you have a source of metrics that both change infrequently and are expensive to generate, the usual recommendation is that you generate them under your own control and then publish them somewhere, for example in Pushgateway or as text files that are collected through the Prometheus host agent's 'textfiles' collector. If the metrics merely change infrequently but are cheap to collect, Prometheus is quite efficient about storing unchanged metrics so you might as well scrape frequently.

PS: The way you change this staleness interval is the command line Prometheus switch '--query.lookback-delta', although making it larger will likely have various effects that increase resource usage. I also suspect that Prometheus is not tested very much with larger settings for this, especially ones substantially longer than the default.

Written on 30 March 2024.
« Some notes on Firefox's media autoplay settings in practice as of Firefox 124
Some thoughts on switching daemons to be socket activated via systemd »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Mar 30 22:22:00 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.