Getting CPU utilization breakdowns efficiently in Prometheus
I wrote before about getting a CPU utilization breakdown in Prometheus, where I detailed building up a query that would give us a correct 0.0 to 1.0 CPU utilization breakdown. The eventual query is:
(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)
(As far as using irate()
here goes, see rate()
versus irate()
.)
This is a beautiful and correct query, but as it turns out you may not want to actually use it. The problem is that in practice, it's also an expensive query when evaluated over a sufficient range, especially if you're using some version of it for multiple machines in the same graph or Grafana dashboard. In some reasonably common cases, I saw Prometheus query durations of over a second for our setup. Once I realized how slow this was, I decided to try to do better.
The obvious way to speed up this query is to precompute the number
that's essentially a constant, namely the number of CPUs (the thing
we're dividing by). To make my life simpler, I opted to compute
this so that we get a separate metric for each mode
, so we don't
have to use group_left
in the actual query. The recording rule
we use is:
- record: instance_mode:node_cpus:count expr: count(node_cpu_seconds_total) without (cpu)
(The name of this recording rule metric is probably questionable, but I don't understand the best practices suggestions here.)
This cuts out a significant amount of the query cost (anywhere from
one half to two thirds or so in some of my tests), but I was still
left with some relatively expensive versions of this query (for
instance, one of our dashboards wants to display the amount of
non-idle CPU utilization across all of our machines). To do better,
I decided to try to pre-compute the sum()
of the CPU modes across
all CPUs, with this recording rule:
- record: instance_mode:node_cpu_seconds_total:sum expr: sum(node_cpu_seconds_total) without (cpu)
In theory this should provide basically the same result with a clear
saving in Prometheus query evaluation time. In practice this mostly
works but occasionally there are some anomalies that I don't
understand, where a rate()
or irate()
of this will exceed 100%
(ie, will return a result greater than the number of CPUs in the
machine). These excessive results are infrequent and you do save a
significant amount of Prometheus query time, which means that there's
a tradeoff to be made here; do you live with the possibility of
rare weird readings in order to get efficient general trends and
overviews, or do you go for complete correctness even at the sake
of higher CPU costs (and graphs that take a bit of time to refresh
or generate themselves)?
(If you know that you want a particular resolution of rate()
a
lot, you can pre-compute that (or pre-compute an irate()
). But
you have to know the resolution, or know that you want irate()
,
and you may not, especially if you're using Grafana and its magic
$__interval
template variable.)
I've been going back and forth on this question since I discovered this issue. Right now my answer is that I'm defaulting to correct results even at more CPU cost unless the CPU cost becomes a real, clear problem. But we have the luxury that our dashboards aren't likely to be used very much.
Sidebar: Why I think the sum()
in this recording rule is okay
The documentation for both rate()
and irate()
tells you to always take the rate()
or irate()
before sum()
'ing,
in order to detect counter resets. However, in this case all of our
counters are tied together; all CPU usage counters for a host will
reset at the same time, when the host reboots, and so rate()
should still see that reset even over a sum()
.
(And the anomalies I've seen have been over time ranges where the hosts involved haven't been rebooting.)
I have two wild theories for why I'm seeing problems with this
recording rule. First, it could be that the recording rule is summing
over a non-coherent set of metric points, where the
node_cpu_seconds_total
values for some CPUs come from one
Prometheus scrape and others come from some other scrape (although
one would hope that metrics from a single scrape appear all at once,
atomically). Second, perhaps the recording rule is being evaluated
twice against the same metric points from the same scrape, because
it is just out of synchronization with a slow scrape of a particular
node_exporter. This would result in a flat result for one point
of the recording rule and then a doubled result for another one,
where the computed result actually covers more time than we expect.
(Figuring out which it is is probably possible through dedicated
extraction and processing of raw metric points from the Prometheus
API, but I lack the patience or the interest to do this at the
moment. My guess is currently the second theory, partly based on
some experimentation with changes()
.)
|
|