Wandering Thoughts archives

2020-02-19

Load average is now generally only a secondary problem indicator

For a long time I've been in the habit of considering a high load average (or an elevated one) to be a primary indicator of problems. It was one of the first numbers I looked at on a system to see how it was, I ran xloads on selected systems to watch it more or less live, I put it on Grafana dashboards, and we've triggered alerts on it for a long time (well before our current metrics and alert setup was set up). But these days I've been moving away from that, because of things like how our login server periodically has brief load average spikes and our IMAP server's elevated load average has no clear cause or impact.

When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.

(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)

Despite my shifting view on this, we're probably going to keep using load average in our alerts and our dashboards. It provides some information and more importantly it's what we're used to; there's value in keeping with history, assuming that the current state of things isn't too noisy (which it isn't; our load average alerts are tuned to basically never go off). But I'm running fewer xloads and spending less time actually looking at load average, unless I want to know about something I know is specifically reflected in it.

sysadmin/LoadAverageSecondarySign written at 23:37:41; Add Comment

How and why we regularly capture information about running processes

In a recent entry, I mentioned that we periodically capture ps and top output on our primary login server, and in fact we do it on pretty much all of our servers. There are three parts to this; the history of how we wound up here, how we do it, and why we've come to do it as a routine thing on our servers.

We had another monitoring system before our current Prometheus based one. One of its handy features was that when it triggered a load average alert, the alert email would include 'top' output rather than just have the load average. Often this led us right to the cause (generally a user running some CPU-heavy thing), even if it had gone away by the time we could look at the server. Prometheus can't do this in any reasonable way, so I did the next best thing by setting up a system to capture 'top' and 'ps' information periodically and save it on the machine for a while. The process information wouldn't be right in the email any more, but at least we could still go look it up later.

Mechanically, this is a cron job and a script that runs every minute and saves 'top' and 'ps' output to a file called 'procs-<HH>:<MM>' (eg 'procs-23:10') in a specific local directory for this purpose (in /var on the system). Using a file naming scheme based on the hour and minute the cron job started and overwriting any current file with that name means that we keep the last 24 hours of data (under normal circumstances). The files are just plain text files without any compression, because disk space is large these days and we don't need anything fancier. On a busy server this amounts to 230 MBytes or so for 24 hours of data; on less active servers it's often under 100 MBytes.

Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.

(We also use Unix process accounting on many machines, but that doesn't give you the kind of moment in time snapshot that capturing 'top' and 'ps' output does.)

sysadmin/OurProcessInfoCapturing written at 00:13:17; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.