Exploring casual questions with our new metrics system
On Mastodon a while back, I said:
What's surprised me about having our new metrics system is how handy it is to be able to answer casual questions, like 'does this machine's CPU get used much', and how often I poke around with that sort of little question or small issue I'm curious about.
(I was expecting us to only really care about metrics when trying to diagnose problems or crises.)
Putting together some sort of metrics and performance statistics system for our servers has been one of my intentions for many years now (see, for example, this 2014 entry, or this one on our then goals). Over all of that time, I have assumed that what mattered for this hypothetical system and what we'd use it for was being able to answer questions about problems (often serious ones), partly to make sure we actually understood our problem, or do things like check for changes in things we think are harmless. Recently we actually put together such a system, based around Prometheus and Grafana, and my experience with it so far has been rather different than I expected.
Over and over again, I've turned to our metrics system to answer relatively casual or small questions where it's simply useful to have the answers, not important or critical. Sometimes it's because we have questions such as how used a compute machine's CPU or memory is; sometimes it's let me confirm an explanation for a little mystery. Some of the time I don't even have a real question, I'm just curious about what's going on with a machine or a service. For instance, I've looked into what our Amanda servers are doing during backups and turned up interesting patterns in disk IO, as well as confirming and firming up some vague theories we had about how they performed and what their speed limits were.
(And just looking at systems has turned up interesting information, simply because I was curious or trying to put together a useful dashboard.)
The common element in all of this is that having a metrics system now makes asking questions and getting answers a pretty easy process. It took a lot of work to get to this point, but now that I've reached it I can plug PromQL queries into Prometheus or look at the dashboards I've built up and pull out a lot with low effort. Since it only takes a little effort to look, I wind up looking a fair bit, even for casual curiosities that we would never have bothered exploring before.
I didn't see this coming at all, not over all of the time that I've been circling around system performance stats and metrics and so on. Perhaps this is to be expected; our focus from the start has been on looking for problems and dealing with them, and when people talk about metrics systems it's mostly about how their system let them see or figure out something important about their environment.
(This focus is natural, since 'it solved our big problem' is a very good argument for why you want a metric system and why investing the time to set one up was a smart decision.)
PS: This is of course yet another example of how reducing friction increases use and visibility and so on. When it is easy to do something, you often wind up doing it more often, as I've seen over and over again.