How Prometheus's query steps (aka query resolution) work

October 13, 2018

Prometheus and the combination of Prometheus and Grafana have many dark corners and barely explained things that you seem to be expected to just understand. One of them is what is variously called query resolution or query steps (in, for example, the Grafana documentation for using Prometheus). Here is what I think I understand about this area, having poked at a number of things and scrutinized the documentation carefully.

In general, when you write a simple Prometheus PromQL query, it is evaluated at some point in time (normally the current instant, unless you use an offset modifier). This includes queries with range vector selectors; the range vector selector chooses how far back to go from the current instant. This is the experience you will get in Prometheus's expression browser console. However, something different happens when you want to graph something, either directly in Prometheus's expression browser or through Grafana, because in order to graph things we need multiple points spread over time, and that means we have to somehow pick which points.

In a Prometheus graphing query, there is a range of time you're covering and then there is the query step. How Prometheus appears to work is that your expression is repeatedly evaluated at instants throughout the time range, starting at the first instant of the time range and then moving forward by the query step until things end. The query step or query resolution (plus the absolute time range) determines how many points you will get back. The HTTP API documentation for range queries makes this more or less explicit in its example; in a query against a 30-second range with a query step of 15 seconds, there are three data points returned, one at the start time, one in the middle, and one at the end time.

A range query's query step is completely independent from any range durations specified in the PromQL expression it evaluates. If you have 'rate(http_requests_total[5m])', you can evaluate this at a query step of 15 seconds and Prometheus doesn't care either way. What happens is that every 15 seconds, you look back 5 minutes and take the rate between then and now. It is rather likely that this rate won't change much on a 15 second basis, so you'll probably get a smooth result. On the other hand, if you use a very large query step with this query, you may see your graphs go very jagged and spiky because you're sampling very infrequently. You may also get surprisingly jagged and staircased results if you have very small query steps.

The Prometheus expression browser's graph view will tell you the natural query step in the top right hand corner (it calls this the query resolution), and it will also let you manually set the query step without changing anything else about the query. This is convenient for getting a hang on what happens to a graph of your data as you change the resolution of a given expression. In Grafana, you have to look at the URL you can see in the editor's query inspector; you're looking for the '&step=<something>' at the end. In Grafana, the minimum step is (or can be) limited in various ways, both for the entire query (in the data source 'Options') and in the individual metrics queries ('Min step', which Grafana grumbles about in the Grafana documentation for using Prometheus).

This unfortunately means that there is no universal range duration that works across all time ranges for Prometheus graphs. Instead the range duration you want is quite dependent on both the query resolution and how frequently your data updates; roughly speaking, I think you want the maximum of the query resolution and something slightly over your metric's minimum update period. Unfortunately I don't believe you can express this in Grafana. This leaves you deciding in advance on the primary usage of your graphs, especially in Grafana; you want to decided if you are mostly going to look at large time ranges with large query steps or small time ranges with fine grained query steps.

(You can get very close to generating the maximum of two times here, but then you run aground on a combination of the limitations of Grafana's dashboard variables and what manipulations you can do in PromQL.)

(This is one of those entries that I write partly for myself in the future, where I am unfortunately probably going to need this.)

Written on 13 October 2018.
« My unusual use for Firefox's Private Browsing mode
Getting a CPU utilization breakdown in Prometheus's query language, PromQL »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 13 02:15:48 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.