The Prometheus host agent's CPU utilization metrics can be a bit weird

March 28, 2020

Among other metrics, the Prometheus host agent collects CPU time statistics on most platforms (including OpenBSD, although it's not listed in the README). This is the familiar division into 'user time', 'system time', 'idle time', and so on, exposed on a per CPU basis on all of the supported platforms (all of which appear to be provided with this by the kernel on a per-CPU basis). We use this in our Grafana dashboards, in two forms. In one form we graph a simple summary of non-idle time, which is produced by subtracting the rate() of idle time from 1, so we can see what hosts have elevated CPU usage; in the other we use a stacked graph of all non-idle time, so we can see where a specific host is spending its CPU time on. Recently, the summary graph showed that one of our OpenBSD L2TP servers was quite busy but our detailed graph for its CPU time wasn't showing all that much; this led me to discover that currently (as of 1.0.0-rc.0), the Prometheus host agent doesn't support OpenBSD's 'spinning' CPU time category.

However, the discovery of this discrepancy and its cause made me wonder about an assumption we've implicitly been making in these graphs (and in general), which is that all of the CPU times really do sum up to 100%. Specifically, we sort of assume that a sum of the rate() of every CPU mode for a specific CPU should be 1 under normal circumstances:

sum( rate( node_cpu_seconds_total ) ) without (mode)

The great thing about a metrics system with a flexible query language is that we don't have to wonder about this; we can look at our data and find out, using Prometheus subqueries. We can look at this for both individual CPUs and the host overall; often, the host overall is more meaningful, because that's what we put in graphs. The simple way to explore this is to look at max_over_time() or min_over_time() for your systems for this over some suitable time interval. The more complicated way is to start looking at the standard deviation, standard variance, and other statistical measures (although at that point you might want to consider trying to visualize a histogram of this data to look at the distribution too).

(You can also simply graph the data and look how noisy it is.)

Now that I've looked at this data for our systems, I can say that while CPU times usually sum up to very close to 100%, they don't always do so. Over a day, most servers have an average sum just under 100%, but there are a decent number of servers (and individual CPUs) where it's under 99%. Individual CPUs can average out as low as 97%. If I look at the maximums and minimums, it's clear that there are real bursts of significant inaccuracies both high and low; over the past day, one CPU on one server saw a total sum of 23.7 seconds in a one-minute rate(), and some dipped as low as 0.6 second (which is 40% of that CPU's utilization just sort of vanishing for that measurement).

Some of these are undoubtedly due to scheduling anomalies with the host agent, where the accumulated CPU time data it reports is not really collected at the time that Prometheus thinks it is, and things either undershoot or overshoot. But I'm not sure that Linux and other Unixes really guarantee that these numbers always add up right even at the best of times. There are always things that can go on inside the kernel, and on multiprocessor systems (which is almost all of them today) there's always a tradeoff over how accurate you are at the cost of how much locking and synchronization.

On a large scale basis this probably doesn't matter. But if I'm looking at data from a system on a very fine timescale because I'm trying to look into a brief anomaly, I probably want to remember that this sort of thing is possible. At that level, those nice CPU utilization graphs may not be quite as trustworthy as they look.

(These issues aren't unique to Prometheus; they're going to happen in anything that collects CPU utilization from a Unix kernel. It's just that Prometheus and other metrics systems immortalize the data for us, so that we can go back and look at it and spot these sorts of anomalies.)

Written on 28 March 2020.
« OpenBSD's 'spinning' CPU time category
I set up Python program options and arguments in a separate function »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Mar 28 01:53:21 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.