Load average is now generally only a secondary problem indicator

February 19, 2020

For a long time I've been in the habit of considering a high load average (or an elevated one) to be a primary indicator of problems. It was one of the first numbers I looked at on a system to see how it was, I ran xloads on selected systems to watch it more or less live, I put it on Grafana dashboards, and we've triggered alerts on it for a long time (well before our current metrics and alert setup was set up). But these days I've been moving away from that, because of things like how our login server periodically has brief load average spikes and our IMAP server's elevated load average has no clear cause or impact.

When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.

(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)

Despite my shifting view on this, we're probably going to keep using load average in our alerts and our dashboards. It provides some information and more importantly it's what we're used to; there's value in keeping with history, assuming that the current state of things isn't too noisy (which it isn't; our load average alerts are tuned to basically never go off). But I'm running fewer xloads and spending less time actually looking at load average, unless I want to know about something I know is specifically reflected in it.

Written on 19 February 2020.
« How and why we regularly capture information about running processes
Link: Stop Using Encrypted Email »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 19 23:37:41 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.