2020-02-19
Load average is now generally only a secondary problem indicator
For a long time I've been in the habit of considering a high load
average (or an elevated one) to be a primary indicator of problems.
It was one of the first numbers I looked at on a system to see how
it was, I ran xload
s on selected systems to watch it more or less
live, I put it on Grafana dashboards, and we've triggered alerts
on it for a long time (well before our current metrics and alert
setup was set up). But these days
I've been moving away from that, because of things like how our
login server periodically has brief load average spikes and our IMAP server's
elevated load average has no clear cause or impact.
When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.
(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)
Despite my shifting view on this, we're probably going to keep using
load average in our alerts and our dashboards. It provides some
information and more importantly it's what we're used to; there's
value in keeping with history, assuming that the current state of
things isn't too noisy (which it isn't; our load average alerts are
tuned to basically never go off). But I'm running fewer xload
s
and spending less time actually looking at load average, unless I
want to know about something I know is specifically reflected in it.
How and why we regularly capture information about running processes
In a recent entry, I mentioned
that we periodically capture ps
and top
output on our primary
login server, and in fact we do it on pretty much all of our servers.
There are three parts to this; the history of how we wound up here,
how we do it, and why we've come to do it as a routine thing on our
servers.
We had another monitoring system before our current Prometheus
based one. One of its handy features
was that when it triggered a load average alert, the alert email
would include 'top
' output rather than just have the load average.
Often this led us right to the cause (generally a user running some
CPU-heavy thing), even if it had gone away by the time we could
look at the server. Prometheus can't do this in any reasonable
way, so I did the next best thing by setting up a system to capture
'top
' and 'ps
' information periodically and save it on the
machine for a while. The process information wouldn't be right in
the email any more, but at least we could still go look it up later.
Mechanically, this is a cron job and a script that runs every minute
and saves 'top
' and 'ps
' output to a file called 'procs-<HH>:<MM>'
(eg 'procs-23:10') in a specific local directory for this purpose
(in /var
on the system). Using a file naming scheme based on the
hour and minute the cron job started and overwriting any current
file with that name means that we keep the last 24 hours of data
(under normal circumstances). The files are just plain text files
without any compression, because disk space is large these days and
we don't need anything fancier. On a busy server this amounts to
230 MBytes or so for 24 hours of data; on less active servers it's
often under 100 MBytes.
Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.
(We also use Unix process accounting on many machines, but that doesn't
give you the kind of moment in time snapshot that capturing 'top
' and
'ps
' output does.)