The problem of paying too much attention to our dashboards
On Mastodon, I said:
Our Grafana dashboards are quite shiny, at least to me (since I built them), but I really should start resisting the compulsive urge to take a look at them all the time just to see what's going on and look at the pretty zigzagging lines.
I have a bad habit of looking at shiny things that I've put together, and dashboards are extremely shiny (even if some of them are almost all text). There are two problems with this, the obvious and the somewhat subtle.
The obvious problem is that, well, I'm spending my time staring somewhat mindlessly at pretty pictures. It's interesting to look at lines wiggle around or collections of numbers, but it's generally not informative. It's especially not informative for our systems because our systems spend almost all of their time working fine, which means that there is no actual relevant information to be had from all of these dashboards. In terms of what I spend (some) time on, I would be better off if we had one dashboard with one box that said 'all is fine'.
This is a general issue with dashboards for healthy environments; if things are fine, your dashboards are probably telling you nothing or at least nothing that is of general interest and importance.
(Your dashboards may be telling you details and at some point you may want access to those details, like how many email messages you typically have in your mailer queues, but they are not generally important.)
The more subtle problem is the general problem of metrics, which is a variant of Goodhart's law. Once you have a metric and you pay attention to the metric, you start to focus on the metric. If you have a dashboard of metrics, it's natural to pay attention to the metrics and to exceptions in the metrics, whether or they actually matter. It may or may not matter that a machine has an unusually high load average, but if it's visible, you're probably going to focus on it and maybe dig into it. Perhaps there is a problem, but often there isn't, especially if you're surfacing a lot of things on your dashboards because they could be useful.
(One of the things behind this is that all measures have some amount of noise and natural variation, but as human beings we have a very strong drive to uncover patterns and meaning in what we see. If you think you see some exceptional pattern, it may or may not be real but you can easily spend a bunch of time trying to find out and testing theories.)
My overall conclusion from my own experiences with our new dashboards and metrics system is that if you have good alerts, you (or at least I) would be better off only looking at dashboards if there is some indications that there are actual problems, or if you have specific questions you'd like to answer. In practice, trawling for 'is there anything interesting' in our dashboards is a great way to spend some time and divert myself down any number of alleyways, most of them not useful ones.
(In a way the worst times are the times when looking at our dashboards actually is useful, because that just encourages me to do it more.)
PS: This is not the first time I've seen the effects of something like this; I wrote about an earlier occasion way back in Metrics considered dangerous.