How and why we regularly capture information about running processes
In a recent entry, I mentioned
that we periodically capture
top output on our primary
login server, and in fact we do it on pretty much all of our servers.
There are three parts to this; the history of how we wound up here,
how we do it, and why we've come to do it as a routine thing on our
We had another monitoring system before our current Prometheus
based one. One of its handy features
was that when it triggered a load average alert, the alert email
would include '
top' output rather than just have the load average.
Often this led us right to the cause (generally a user running some
CPU-heavy thing), even if it had gone away by the time we could
look at the server. Prometheus can't do this in any reasonable
way, so I did the next best thing by setting up a system to capture
top' and '
ps' information periodically and save it on the
machine for a while. The process information wouldn't be right in
the email any more, but at least we could still go look it up later.
Mechanically, this is a cron job and a script that runs every minute
and saves '
top' and '
ps' output to a file called 'procs-<HH>:<MM>'
(eg 'procs-23:10') in a specific local directory for this purpose
/var on the system). Using a file naming scheme based on the
hour and minute the cron job started and overwriting any current
file with that name means that we keep the last 24 hours of data
(under normal circumstances). The files are just plain text files
without any compression, because disk space is large these days and
we don't need anything fancier. On a busy server this amounts to
230 MBytes or so for 24 hours of data; on less active servers it's
often under 100 MBytes.
Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.
(We also use Unix process accounting on many machines, but that doesn't
give you the kind of moment in time snapshot that capturing '
ps' output does.)