Wandering Thoughts archives

2020-08-20

What you're looking for with a Grafana dashboard affects its settings

Recently I wrote about how we chose our time intervals in dashboards, where the answer is that we mostly use $__interval because for our purposes this is the best option. But this raises the question of what is our purpose with our dashboards. Put another way, why do we not care about seeing brief spikes in our dashboards?

Broadly speaking, I think that dashboards can be there to look for signs of obvious issues, to look for signs of subtle issues, or to diagnose problems in detail (when you already know there's an issue and you're trying to understand what's going on). Pretty much all of our dashboards are for some combination of the first or the last, and we don't normally go looking for subtle issues.

(The flipside of looking for signs of obvious issues is reassuring you that there are no obvious issues right now. From a cynical perspective, this may be the purpose of a lot of overview dashboards.)

When you're looking for obvious issues, broad overviews are generally fine. If you have periodic very short usage spikes but nothing else notices on a larger scale, you almost certainly don't have an obvious issue. Similarly, showing very short usage spikes on a broad overview graph isn't necessarily useful unless you believe that these spikes are the sign of a larger issue. As a result, you might as well use $__interval even though it makes short term spikes disappear when you're looking at longer time periods.

When you're trying to diagnose problems in detail you already know something is going on and you're probably looking at fine time scales around specific times of interest. At fine time scales, a properly set up Grafana dashboard will show you all of the information available, including fine grained spikes, because it's using a very short $__interval since it covers only a small time range. This is certainly my experience with our dashboards, where I often wind up looking at only five or ten minute time windows in order to try to really understand what was going on at some point.

Looking for subtle issues is an interesting challenge in dashboard design. I suspect it's hard to do without knowing a fair bit about how your environment is supposed to behave (or at least believing that you do). At this point it's not something that I'm doing very much of in our dashboard design (although I've sort of done some of it).

(See also the problem of paying too much attention to our dashboards.)

sysadmin/DashboardsWhatForAndSettings written at 23:57:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.