When you do and don't get stuck query results from a down Prometheus
A while back I wrote about Prometheus and the case of the stuck metrics, where after my home desktop crashed, metrics from its Prometheus instance would be 'stuck' at constant values for a while. This happens because Prometheus will look backward a certain amount of time in order to find the most recent sample in a time series. When my machine crashed, Prometheus obviously stopped writing samples, so for the next roughly five minutes the most recent sample that was returned by queries was the last pre-crash one.
The first thing to know is that this (currently) happens if Prometheus is shut down in an orderly way, not just if Prometheus (or the entire host) crashes. Unlike what it does when a target fails to scrape (where all time series are marked stale on the spot), shutting down Prometheus currently doesn't insert stale markers for metrics (and arguably this is sensible behavior, plus it speeds up shutdowns). This means that things are at least regular; the behavior you see on crashes or freezes is the same behavior that you see if you are merely rebooting your Prometheus host or restarting Prometheus itself (for example to upgrade it).
The second thing to know is that not all PromQL
queries behave this way. As I've observed myself, some queries (and
thus Grafana graphs) will give a constant result, but others will
disappear entirely. The difference in the queries is whether or not
they use range vectors, because range vectors don't look back
outside their time range. If you say you want a '[60s]
' range
vector at a particular point in time, Prometheus gives you exactly
what you asked for and no more; you get any samples within those
past sixty seconds and that's it.
Thus, if you ask for just 'some_metric
' at a query step where Prometheus was down, it will look back
up to five minutes for the most recent samples of that metric, with
the result that your query 'sticks' for about five minutes, giving
the same result every time. However, if you ask for
'max_over_time(some_metric[60s])
' at the same query step and
there isn't a sample within 60 seconds, you get nothing even if
there is a sample within the wider five minute window; your query
gets "empty query result" and the line in your Grafana graph
disappears. If your expression involves a rate()
or irate()
of
a range vector, you need at least two samples from each time series
within the time range, so your queries will disappear a bit faster
(how much faster depends on the scrape interval).
All of this makes sense now that I've thought about it carefully (and done some testing to confirm it), but the difference in behavior between simple queries and queries with range vectors is a little bit surprising.
|
|