Load average is now generally only a secondary problem indicator
For a long time I've been in the habit of considering a high load
average (or an elevated one) to be a primary indicator of problems.
It was one of the first numbers I looked at on a system to see how
it was, I ran
xloads on selected systems to watch it more or less
live, I put it on Grafana dashboards, and we've triggered alerts
on it for a long time (well before our current metrics and alert
setup was set up). But these days
I've been moving away from that, because of things like how our
login server periodically has brief load average spikes and our IMAP server's
elevated load average has no clear cause or impact.
When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.
(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)
Despite my shifting view on this, we're probably going to keep using
load average in our alerts and our dashboards. It provides some
information and more importantly it's what we're used to; there's
value in keeping with history, assuming that the current state of
things isn't too noisy (which it isn't; our load average alerts are
tuned to basically never go off). But I'm running fewer
and spending less time actually looking at load average, unless I
want to know about something I know is specifically reflected in it.