How and why we regularly capture information about running processes

February 19, 2020

In a recent entry, I mentioned that we periodically capture ps and top output on our primary login server, and in fact we do it on pretty much all of our servers. There are three parts to this; the history of how we wound up here, how we do it, and why we've come to do it as a routine thing on our servers.

We had another monitoring system before our current Prometheus based one. One of its handy features was that when it triggered a load average alert, the alert email would include 'top' output rather than just have the load average. Often this led us right to the cause (generally a user running some CPU-heavy thing), even if it had gone away by the time we could look at the server. Prometheus can't do this in any reasonable way, so I did the next best thing by setting up a system to capture 'top' and 'ps' information periodically and save it on the machine for a while. The process information wouldn't be right in the email any more, but at least we could still go look it up later.

Mechanically, this is a cron job and a script that runs every minute and saves 'top' and 'ps' output to a file called 'procs-<HH>:<MM>' (eg 'procs-23:10') in a specific local directory for this purpose (in /var on the system). Using a file naming scheme based on the hour and minute the cron job started and overwriting any current file with that name means that we keep the last 24 hours of data (under normal circumstances). The files are just plain text files without any compression, because disk space is large these days and we don't need anything fancier. On a busy server this amounts to 230 MBytes or so for 24 hours of data; on less active servers it's often under 100 MBytes.

Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.

(We also use Unix process accounting on many machines, but that doesn't give you the kind of moment in time snapshot that capturing 'top' and 'ps' output does.)


Comments on this page:

By Arnaud Gomes at 2020-02-19 02:56:26:

We have a more hosting-oriented version of the same at work: we log the output of ps, MySQL full processlist (including things like locking info on some select machines), netstat output and the server status of every Apache and Nginx we run. We don't use these logs often but we find them invaluable the few times a year we do need them.

A side-effect (actually wrapped in a couple of scripts) is that we can compare the output of ps before and after a reboot to make sure everything is properly autostarted.

I guess I ought to write all of this at length in a blog post. :-)

By erlogan at 2020-02-19 13:38:20:

Have you considered using atop?

Written on 19 February 2020.
« The uncertainty of an elevated load average on our Linux IMAP server
Load average is now generally only a secondary problem indicator »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 19 00:13:17 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.