Why we should care about usage data for our internal services

March 11, 2024

I recently wrote about some practical-focused thoughts on usage data for your services. But there's a broader issue about usage data for services and having or not having it. My sense is that for a lot of sysadmins, building things to collect usage data feels like accounting work and likely to lead to unpleasant and damaging things, like internal chargebacks (which have create various problems, and also). However, I think we should strongly consider routinely gathering this data anyway, for fundamentally the same reasons as you should collect information on what TLS protocols and ciphers are being used by your people and software.

We periodically face decisions both obvious and subtle about what to do about services and the things they run on. Do we spend the money to buy new hardware, do we spend the time to upgrade the operating system or the version of the third party software, do we need to closely monitor this system or service, does it need to be optimized or be given better hardware, and so on. Conversely, maybe this is now a little-used service that can be scaled down, dropped, or simplified. In general, the big question is do we need to care about this service, and if so how much. High level usage data is what gives you most of the real answers.

(In some environments one fate for narrowly used services is to be made the responsibility of the people or groups who are the service's big users, instead of something that is provided on a larger and higher level.)

Your system and application metrics can provide you some basic information, like whether your systems are using CPU and memory and disk space, and perhaps how that usage is changing over a relatively long time base (if you keep metrics data long enough). But they can't really tell you why that is happening or not happening, or who is using your services, and deriving usage information from things like CPU utilization requires either knowing things about how your systems perform or assuming them (eg, assuming you can estimate service usage from CPU usage because you're sure it uses a visible amount of CPU time). Deliberately collecting actual usage gives you direct answers.

Knowing who is using your services and who is not also gives you the opportunity to talk to both groups about what they like about your current services, what they'd like you to add, what pieces of your service they care about, what they need, and perhaps what's keeping them from using some of your services. If you don't have usage data and don't actually ask people, you're flying relatively blind on all of these questions.

Of course collecting usage data has its traps. One of them is that what usage data you collect is often driven by what sort of usage you think matters, and in turn this can be driven by how you expect people to use your services and what you think they care about. Or to put it another way, you're measuring what you assume matters and you're assuming what you don't measure doesn't matter. You may be wrong about that, which is one reason why talking to people periodically is useful.

PS: In theory, gathering usage data is separate from the question of whether you should pay attention to it, where the answer may well be that you should ignore that shiny new data. In practice, well, people are bad at staying away from shiny things. Perhaps it's not a bad thing to have your usage data require some effort to assemble.

(This is partly written to persuade myself of this, because maybe we want to routinely collect and track more usage data than we currently do.)

Written on 11 March 2024.
« Scheduling latency, IO latency, and their role in Linux responsiveness
What do we count as 'manual' management of TLS certificates »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 11 22:47:02 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.