Having metrics for something attracts your attention to it

May 17, 2023

For reasons beyond the scope of this entry, we didn't collect any metrics from our Ubuntu 18.04 ZFS fileservers (trying to do so early on led to kernel panics). When we upgraded all of them to Ubuntu 22.04, we changed this, putting various host agents on them and collecting a horde of metrics that go into our Prometheus metrics system, some of which automatically appear on our dashboards. One of the results of this is that we've started noticing things about what's happening on our fileservers. For example, at various times, we've noticed significant NFS read volume, significant NFS RPC counts, visible load averages, and specific moments when the ZFS ARC has shrunk. Noticing these things has led us to investigate some of them and pushed me to put together tools to make this easier.

What we haven't seen is any indication that these things we're now noticing are causing issues on our NFS clients (ie, our normal Ubuntu servers), or that they're at all unusual. Right now, my best guess is that everything we're seeing now has been quietly going on for some time. Every so often for years, people have run jobs on our SLURM cluster that repeatedly read a lot of data over NFS, and other people have run things that scan directories a lot, and I know our ZFS ARC size has been bouncing around for a long time. Instead, what we're seeing is that metrics attract attention, at least when they're new.

This isn't necessarily a bad thing, as long as we don't over-react. Before we had these metrics we probably had very little idea what was a normal operating state for our fileservers, so if we'd had to look at them during a problem we'd have had much less idea what was normal and what was exceptional. Now we're learning more, and in a while the various things these metrics are telling us probably won't be surprising news (and to a certain extent that's already happening).

This is in theory not a new idea for me, but it's one thing to know it intellectually and another thing to experience it as new metrics appear and I start digging into them and what they expose. It's at least been a while since I went through this experience, and this time around is a useful reminder.

(This is related to the idea that having metrics for something can be dangerous and also that dashboards can be overly attractive. Have I maybe spent a bit too much time fiddling with ZFS ARC metrics when our ARC sizes don't really matter because our ARC hit rates are high? Possibly.)

PS: Technically what attracts attention is being able to readily see those metrics, not the metrics themselves. We collect huge piles of metrics that draw no attention at all because they go straight into the Prometheus database and never get visualized on any dashboards. But that's a detail, so let's pretend that we collect metrics because we're going to use them instead of because they're there by default.

Written on 17 May 2023.
« The time our Linux systems spend on integer to text and back conversions
(Graphical) Unix has always had desktop environments »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 17 18:08:37 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.