2024-03-09
Some thoughts on usage data for your systems and services
Some day, you may be called on by decision makers (including yourself) to provide some sort of usage information for things you operate so that you can make decisions about them. I'm not talking about system metrics such as how much CPU is being used (although for some systems that may be part of higher level usage information, for example for our SLURM cluster); this is more on the level of how much things are being used, by who, and perhaps for what. In the very old days we might have called this 'accounting data' (and perhaps disdained collecting it unless we were forced to by things like chargeback policies).
In an ideal world, you will already be generating and retaining the sort of usage information that can be used to make decisions about services. But internal services aren't necessarily automatically instrumented the way revenue generating things are, so you may not have this sort of thing built in from the start. In this case, you'll generally wind up hunting around for creative ways to generate higher level usage information from low level metrics and logs that you do have. When you do this, my first suggestion is write down how you generated your usage information. This probably won't be the last time you need to generate usage information, and also if decision makers (including you in the future) have questions about exactly what your numbers mean, you can go back to look at exactly how you generated them to provide answers.
(Of course, your systems may have changed around by the next time you need to generate usage information, so your old ways don't work or aren't applicable. But at least you'll have something.)
My second suggestion is to look around today to see if there's data you can easily collect and retain now that will let you provide better usage information in the future. This is obviously related to keeping your logs longer, but it also includes making sure that things make it to your logs (or at least to your retained logs, which may mean setting things to send their log data to syslog instead of keeping their own log files). At this point I will sing the praises of things like 'end of session' summary log records that put all of the information about a session in a single place instead of forcing you to put the information together from multiple log lines.
(When you've just been through the exercise of generating usage data is an especially good time to do this, because you'll be familiar with all of the bits that were troublesome or where you could only provide limited data.)
Of course there are privacy implications of retaining lots of logs and usage data. This may be a good time to ask around to get advance agreement on what sort of usage information you want to be able to provide and what sort you definitely don't want to have available for people to ask for. This is also another use for arranging to log your own 'end of session' summary records, because if you're doing it yourself you can arrange to include only the usage information you've decided is okay.