Wandering Thoughts archives


Getting a CPU utilization breakdown in Prometheus's query language, PromQL

A certain amount of Prometheus's query language is reasonably obvious, but once you start getting into the details and the clever tricks you wind up needing to wrap your mind around how PromQL wants you to think about its world. Today I want to tackle one apparently obvious thing, which is getting a graph (or numbers) of CPU utilization.

Prometheus's host agent (its 'node exporter') gives us per-CPU, per mode usage stats as a running counter of seconds in that mode (which is basically what the Linux kernel gives us). A given data point of this looks like:

node_cpu_seconds_total{cpu="1", instance="comps1:9100", job="node", mode="user"} 3632.28

Suppose that we want to know how our machine's entire CPU state breaks down over a time period. Our starting point is the rate over non-idle CPU modes:

irate(node_cpu_seconds_total {mode!="idle"} [1m])

(I'm adding some spaces here to make things wrap better here on Wandering Thoughts; in practice, it's conventional to leave them all out.)

Unfortunately this gives us the rate of individual CPUs (expressed as time in that mode per second, because rate() gives us a per second rate). No problem, let's sum that over everything but the CPUs:

sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)

If you do this on a busy system with multiple CPUs, you will soon observe that the numbers add up to more than 1 second. This is because we're summing over multiple CPUs; if each of them is in user mode for all of the time, the summed rate of user mode is however many CPUs we have. In order to turn this into a percentage, we need to divide by how many CPUs the machine has. We could hardcode this, but we may have different numbers of CPUs on different machines. So how do we count how many CPUs we have in a machine?

As a stand-alone expression, counting CPUs is (sort of):

count(node_cpu_seconds_total) without (cpu)

Let's break this down, since I breezed over 'without (cpu)' before. This takes our per-CPU, per-host node_cpu_seconds_total Prometheus metric, and counts up how many things there are in each distinct set of labels when you ignore the cpu label. This doesn't give us a CPU count number; instead it gives us a CPU count per CPU mode:

{instance="comps1:9100", job="node", mode="user"} 32

Fortunately this is what we want in the full expression:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

Our right side is a vector, and when you divide by vectors in PromQL, you divide by matching elements (ie, the same set of labels). On the left we have labels and values like this:

{instance="comps1:9100", job="node", mode="user"} 2.9826666666675776

And on the right we have a matching set of labels, as we saw, that gives us the number '32'. So it all works out.

In general, when you're doing this sort of cross-metric operation you need to make it so that the labels come out the same on each side. If you try too hard to turn your CPU count into a pure number, well, it can work if you get the magic right but you probably want to go at it the PromQL way and match the labels the way we have.

(I'm writing this down today because while it all seems obvious and clear to me now, that's because I've spent much of the last week immersed in Prometheus and Grafana. Once we get our entire system set up, it's quite likely that I'll not deal with Prometheus for months at a time and thus will have forgotten all of this 'obvious' stuff by the next time I have to touch something here.)

PS: The choice of irate() versus rate() is a complicated subject that requires an entry of its own. The short version is that if you are looking at statistics over a short time range with a small query step, you probably want to use irate() with a range selector that is normally a couple of times your basic sampling interval.

sysadmin/PrometheusCPUStats written at 21:47:03; Add Comment

How Prometheus's query steps (aka query resolution) work

Prometheus and the combination of Prometheus and Grafana have many dark corners and barely explained things that you seem to be expected to just understand. One of them is what is variously called query resolution or query steps (in, for example, the Grafana documentation for using Prometheus). Here is what I think I understand about this area, having poked at a number of things and scrutinized the documentation carefully.

In general, when you write a simple Prometheus PromQL query, it is evaluated at some point in time (normally the current instant, unless you use an offset modifier). This includes queries with range vector selectors; the range vector selector chooses how far back to go from the current instant. This is the experience you will get in Prometheus's expression browser console. However, something different happens when you want to graph something, either directly in Prometheus's expression browser or through Grafana, because in order to graph things we need multiple points spread over time, and that means we have to somehow pick which points.

In a Prometheus graphing query, there is a range of time you're covering and then there is the query step. How Prometheus appears to work is that your expression is repeatedly evaluated at instants throughout the time range, starting at the first instant of the time range and then moving forward by the query step until things end. The query step or query resolution (plus the absolute time range) determines how many points you will get back. The HTTP API documentation for range queries makes this more or less explicit in its example; in a query against a 30-second range with a query step of 15 seconds, there are three data points returned, one at the start time, one in the middle, and one at the end time.

A range query's query step is completely independent from any range durations specified in the PromQL expression it evaluates. If you have 'rate(http_requests_total[5m])', you can evaluate this at a query step of 15 seconds and Prometheus doesn't care either way. What happens is that every 15 seconds, you look back 5 minutes and take the rate between then and now. It is rather likely that this rate won't change much on a 15 second basis, so you'll probably get a smooth result. On the other hand, if you use a very large query step with this query, you may see your graphs go very jagged and spiky because you're sampling very infrequently. You may also get surprisingly jagged and staircased results if you have very small query steps.

The Prometheus expression browser's graph view will tell you the natural query step in the top right hand corner (it calls this the query resolution), and it will also let you manually set the query step without changing anything else about the query. This is convenient for getting a hang on what happens to a graph of your data as you change the resolution of a given expression. In Grafana, you have to look at the URL you can see in the editor's query inspector; you're looking for the '&step=<something>' at the end. In Grafana, the minimum step is (or can be) limited in various ways, both for the entire query (in the data source 'Options') and in the individual metrics queries ('Min step', which Grafana grumbles about in the Grafana documentation for using Prometheus).

This unfortunately means that there is no universal range duration that works across all time ranges for Prometheus graphs. Instead the range duration you want is quite dependent on both the query resolution and how frequently your data updates; roughly speaking, I think you want the maximum of the query resolution and something slightly over your metric's minimum update period. Unfortunately I don't believe you can express this in Grafana. This leaves you deciding in advance on the primary usage of your graphs, especially in Grafana; you want to decided if you are mostly going to look at large time ranges with large query steps or small time ranges with fine grained query steps.

(You can get very close to generating the maximum of two times here, but then you run aground on a combination of the limitations of Grafana's dashboard variables and what manipulations you can do in PromQL.)

(This is one of those entries that I write partly for myself in the future, where I am unfortunately probably going to need this.)

sysadmin/PrometheusQuerySteps written at 02:15:48; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.