== How and why we regularly capture information about running processes In [[a recent entry ../linux/LoadAverageMultiuserSpikes]], I mentioned that we periodically capture _ps_ and _top_ output on our primary login server, and in fact we do it on pretty much all of our servers. There are three parts to this; the history of how we wound up here, how we do it, and why we've come to do it as a routine thing on our servers. We had another monitoring system before [[our current Prometheus based one PrometheusGrafanaSetup-2019]]. One of its handy features was that when it triggered a load average alert, the alert email would include '_top_' output rather than just have the load average. Often this led us right to the cause (generally a user running some CPU-heavy thing), even if it had gone away by the time we could look at the server. Prometheus can't do this in any reasonable way, so I did the next best thing by setting up a system to capture '_top_' and '_ps_' information periodically and save it on the machine for a while. The process information wouldn't be right in the email any more, but at least we could still go look it up later. Mechanically, this is a cron job and a script that runs every minute and saves '_top_' and '_ps_' output to a file called 'procs-:' (eg 'procs-23:10') in a specific local directory for this purpose (in _/var_ on the system). Using a file naming scheme based on the hour and minute the cron job started and overwriting any current file with that name means that we keep the last 24 hours of data (under normal circumstances). The files are just plain text files without any compression, because disk space is large these days and we don't need anything fancier. On a busy server this amounts to 230 MBytes or so for 24 hours of data; on less active servers it's often under 100 MBytes. Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place. (We also use Unix process accounting on many machines, but that doesn't give you the kind of moment in time snapshot that capturing '_top_' and '_ps_' output does.)