When you do and don't get stuck query results from a down Prometheus

August 4, 2021

A while back I wrote about Prometheus and the case of the stuck metrics, where after my home desktop crashed, metrics from its Prometheus instance would be 'stuck' at constant values for a while. This happens because Prometheus will look backward a certain amount of time in order to find the most recent sample in a time series. When my machine crashed, Prometheus obviously stopped writing samples, so for the next roughly five minutes the most recent sample that was returned by queries was the last pre-crash one.

The first thing to know is that this (currently) happens if Prometheus is shut down in an orderly way, not just if Prometheus (or the entire host) crashes. Unlike what it does when a target fails to scrape (where all time series are marked stale on the spot), shutting down Prometheus currently doesn't insert stale markers for metrics (and arguably this is sensible behavior, plus it speeds up shutdowns). This means that things are at least regular; the behavior you see on crashes or freezes is the same behavior that you see if you are merely rebooting your Prometheus host or restarting Prometheus itself (for example to upgrade it).

The second thing to know is that not all PromQL queries behave this way. As I've observed myself, some queries (and thus Grafana graphs) will give a constant result, but others will disappear entirely. The difference in the queries is whether or not they use range vectors, because range vectors don't look back outside their time range. If you say you want a '[60s]' range vector at a particular point in time, Prometheus gives you exactly what you asked for and no more; you get any samples within those past sixty seconds and that's it.

Thus, if you ask for just 'some_metric' at a query step where Prometheus was down, it will look back up to five minutes for the most recent samples of that metric, with the result that your query 'sticks' for about five minutes, giving the same result every time. However, if you ask for 'max_over_time(some_metric[60s])' at the same query step and there isn't a sample within 60 seconds, you get nothing even if there is a sample within the wider five minute window; your query gets "empty query result" and the line in your Grafana graph disappears. If your expression involves a rate() or irate() of a range vector, you need at least two samples from each time series within the time range, so your queries will disappear a bit faster (how much faster depends on the scrape interval).

All of this makes sense now that I've thought about it carefully (and done some testing to confirm it), but the difference in behavior between simple queries and queries with range vectors is a little bit surprising.

Written on 04 August 2021.
« Anonymous ("transparent") structures are a good thing in programming languages
I have mixed views on new DNS top level domains (TLDs) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 4 00:12:45 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.