The Prometheus host agent's CPU utilization metrics can be a bit weird
Among other metrics, the Prometheus host agent collects CPU time
statistics on most platforms (including OpenBSD, although it's not
listed in the README). This is the familiar division into 'user
time', 'system time', 'idle time', and so on, exposed on a per CPU
basis on all of the supported platforms (all of which appear to be
provided with this by the kernel on a per-CPU basis). We use this
in our Grafana dashboards, in two forms.
In one form we graph a simple summary of non-idle time, which is
produced by subtracting the
rate() of idle time from 1, so we can
see what hosts have elevated CPU usage; in the other we use a stacked
graph of all non-idle time, so we can see where a specific host is
spending its CPU time on. Recently, the summary graph showed that
one of our OpenBSD L2TP servers was quite busy but our detailed
graph for its CPU time wasn't showing all that much; this led me
to discover that currently (as of 1.0.0-rc.0), the Prometheus host
agent doesn't support OpenBSD's 'spinning' CPU time category.
However, the discovery of this discrepancy and its cause made me wonder about an assumption we've implicitly been making in these graphs (and in general), which is that all of the CPU times really do sum up to 100%. Specifically, we sort of assume that a sum of the rate() of every CPU mode for a specific CPU should be 1 under normal circumstances:
sum( rate( node_cpu_seconds_total ) ) without (mode)
The great thing about a metrics system with a flexible query language is that we don't have to wonder about this; we can look at our data and find out, using Prometheus subqueries. We can look at this for both individual CPUs and the host overall; often, the host overall is more meaningful, because that's what we put in graphs. The simple way to explore this is to look at max_over_time() or min_over_time() for your systems for this over some suitable time interval. The more complicated way is to start looking at the standard deviation, standard variance, and other statistical measures (although at that point you might want to consider trying to visualize a histogram of this data to look at the distribution too).
(You can also simply graph the data and look how noisy it is.)
Now that I've looked at this data for our systems, I can say that while
CPU times usually sum up to very close to 100%, they don't always do
so. Over a day, most servers have an average sum just under 100%, but
there are a decent number of servers (and individual CPUs) where it's
under 99%. Individual CPUs can average out as low as 97%. If I look
at the maximums and minimums, it's clear that there are real bursts
of significant inaccuracies both high and low; over the past day, one
CPU on one server saw a total sum of 23.7 seconds in a one-minute
rate(), and some dipped as low as 0.6 second (which is 40% of that
CPU's utilization just sort of vanishing for that measurement).
Some of these are undoubtedly due to scheduling anomalies with the host agent, where the accumulated CPU time data it reports is not really collected at the time that Prometheus thinks it is, and things either undershoot or overshoot. But I'm not sure that Linux and other Unixes really guarantee that these numbers always add up right even at the best of times. There are always things that can go on inside the kernel, and on multiprocessor systems (which is almost all of them today) there's always a tradeoff over how accurate you are at the cost of how much locking and synchronization.
On a large scale basis this probably doesn't matter. But if I'm looking at data from a system on a very fine timescale because I'm trying to look into a brief anomaly, I probably want to remember that this sort of thing is possible. At that level, those nice CPU utilization graphs may not be quite as trustworthy as they look.
(These issues aren't unique to Prometheus; they're going to happen in anything that collects CPU utilization from a Unix kernel. It's just that Prometheus and other metrics systems immortalize the data for us, so that we can go back and look at it and spot these sorts of anomalies.)
Comments on this page:Written on 28 March 2020.