Getting CPU utilization breakdowns efficiently in Prometheus

November 9, 2018

I wrote before about getting a CPU utilization breakdown in Prometheus, where I detailed building up a query that would give us a correct 0.0 to 1.0 CPU utilization breakdown. The eventual query is:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

(As far as using irate() here goes, see rate() versus irate().)

This is a beautiful and correct query, but as it turns out you may not want to actually use it. The problem is that in practice, it's also an expensive query when evaluated over a sufficient range, especially if you're using some version of it for multiple machines in the same graph or Grafana dashboard. In some reasonably common cases, I saw Prometheus query durations of over a second for our setup. Once I realized how slow this was, I decided to try to do better.

The obvious way to speed up this query is to precompute the number that's essentially a constant, namely the number of CPUs (the thing we're dividing by). To make my life simpler, I opted to compute this so that we get a separate metric for each mode, so we don't have to use group_left in the actual query. The recording rule we use is:

- record: instance_mode:node_cpus:count
  expr: count(node_cpu_seconds_total) without (cpu)

(The name of this recording rule metric is probably questionable, but I don't understand the best practices suggestions here.)

This cuts out a significant amount of the query cost (anywhere from one half to two thirds or so in some of my tests), but I was still left with some relatively expensive versions of this query (for instance, one of our dashboards wants to display the amount of non-idle CPU utilization across all of our machines). To do better, I decided to try to pre-compute the sum() of the CPU modes across all CPUs, with this recording rule:

- record: instance_mode:node_cpu_seconds_total:sum
  expr: sum(node_cpu_seconds_total) without (cpu)

In theory this should provide basically the same result with a clear saving in Prometheus query evaluation time. In practice this mostly works but occasionally there are some anomalies that I don't understand, where a rate() or irate() of this will exceed 100% (ie, will return a result greater than the number of CPUs in the machine). These excessive results are infrequent and you do save a significant amount of Prometheus query time, which means that there's a tradeoff to be made here; do you live with the possibility of rare weird readings in order to get efficient general trends and overviews, or do you go for complete correctness even at the sake of higher CPU costs (and graphs that take a bit of time to refresh or generate themselves)?

(If you know that you want a particular resolution of rate() a lot, you can pre-compute that (or pre-compute an irate()). But you have to know the resolution, or know that you want irate(), and you may not, especially if you're using Grafana and its magic $__interval template variable.)

I've been going back and forth on this question since I discovered this issue. Right now my answer is that I'm defaulting to correct results even at more CPU cost unless the CPU cost becomes a real, clear problem. But we have the luxury that our dashboards aren't likely to be used very much.

Sidebar: Why I think the sum() in this recording rule is okay

The documentation for both rate() and irate() tells you to always take the rate() or irate() before sum()'ing, in order to detect counter resets. However, in this case all of our counters are tied together; all CPU usage counters for a host will reset at the same time, when the host reboots, and so rate() should still see that reset even over a sum().

(And the anomalies I've seen have been over time ranges where the hosts involved haven't been rebooting.)

I have two wild theories for why I'm seeing problems with this recording rule. First, it could be that the recording rule is summing over a non-coherent set of metric points, where the node_cpu_seconds_total values for some CPUs come from one Prometheus scrape and others come from some other scrape (although one would hope that metrics from a single scrape appear all at once, atomically). Second, perhaps the recording rule is being evaluated twice against the same metric points from the same scrape, because it is just out of synchronization with a slow scrape of a particular node_exporter. This would result in a flat result for one point of the recording rule and then a doubled result for another one, where the computed result actually covers more time than we expect.

(Figuring out which it is is probably possible through dedicated extraction and processing of raw metric points from the Prometheus API, but I lack the patience or the interest to do this at the moment. My guess is currently the second theory, partly based on some experimentation with changes().)

Written on 09 November 2018.
« The future of our homedir-based mail server system design
Character by character TTY input in Unix, then and now »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 9 23:40:14 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.