2020-08-20
What you're looking for with a Grafana dashboard affects its settings
Recently I wrote about how we chose our time intervals in dashboards, where the answer is that we mostly
use $__interval
because for our purposes this is the best option. But this raises
the question of what is our purpose with our dashboards. Put another
way, why do we not care about seeing brief spikes in our dashboards?
Broadly speaking, I think that dashboards can be there to look for signs of obvious issues, to look for signs of subtle issues, or to diagnose problems in detail (when you already know there's an issue and you're trying to understand what's going on). Pretty much all of our dashboards are for some combination of the first or the last, and we don't normally go looking for subtle issues.
(The flipside of looking for signs of obvious issues is reassuring you that there are no obvious issues right now. From a cynical perspective, this may be the purpose of a lot of overview dashboards.)
When you're looking for obvious issues, broad overviews are generally
fine. If you have periodic very short usage spikes but nothing else
notices on a larger scale, you almost certainly don't have an
obvious issue. Similarly, showing very short usage spikes on a
broad overview graph isn't necessarily useful unless you believe
that these spikes are the sign of a larger issue. As a result, you
might as well use $__interval
even though it makes short term
spikes disappear when you're looking at longer time periods.
When you're trying to diagnose problems in detail you already know
something is going on and you're probably looking at fine time
scales around specific times of interest. At fine time scales, a
properly set up Grafana dashboard will show you all of the information
available, including fine grained spikes, because it's using a very
short $__interval
since it covers only a small time range. This
is certainly my experience with our dashboards, where I often wind
up looking at only five or ten minute time windows in order to try
to really understand what was going on at some point.
Looking for subtle issues is an interesting challenge in dashboard design. I suspect it's hard to do without knowing a fair bit about how your environment is supposed to behave (or at least believing that you do). At this point it's not something that I'm doing very much of in our dashboard design (although I've sort of done some of it).
(See also the problem of paying too much attention to our dashboards.)