How we choose our time intervals in our Grafana dashboards

August 7, 2020

In a comment on my entry on our Prometheus and Grafana setup, trallnag asked a good question:

Would you mind sharing your concrete approach to setting the time intervals for functions like rate() and increase()?

This is a good question, because trallnag goes on to cover why this is an issue you may want to think about:

I tend to switch between using $__interval, completely fixed values like 5m or a Grafana interval variable with multiple interval to choose from. None are perfect and all fail in certain circumstances, ranging from missing spikes with $__interval to under or oversampling with custom intervals.

The very simple answer is that so far I've universally used $__interval, which is Grafana's templating variable for 'whatever the step is on this graph given the time scale you're currently covering'. Using $__interval means that your graph is (theoretically) continuous but without oversampling; every moment in time is used for one and only one graph point.

The more complete answer is that we use $__interval but often tell Grafana that there is a minimum interval for the query that is usually slightly larger than how often we generate the metric. When you use rate(), increase(), and their kin, you need to make sure that your interval always has at least two metric points, otherwise they give you no value and your graphs look funny. Since we're using variable intervals, we have to set the minimum interval.

In a few graphs I've experimented with combining rate() and irate() with an or clause:

rate( ...[$__interval] ) or
   irate( ...[4m] )

The idea here is that if the interval is too short to get two metric points, the rate() will generate nothing and we fall through to irate(), which will give us the rate across the two most recent metric points (see rate() versus irate()). Unfortunately, this is both annoying to write (since you have to repeat your metric condition) and inefficient (since Prometheus will always evaluate both the rate() and the irate()), so I've mostly abandoned it.

The high level answer is that we use $__interval because I don't have a reason to make things more complicated. Our Grafana dashboards are for overviews (even detailed overviews), not narrow troubleshooting, and I feel that for this a continuous graph is generally the most useful. It's certainly the easiest to make work at both small and large timescales (including ones like 'the last week'). We're also in the position where we don't care specifically about the rate of anything over a fixed interval (eg, 'error rate in the last 5 minute should be under ...'), and probably don't care about momentary spikes, especially when we're using a large time range with a dashboard.

(Over a small time range, a continuous graph of rate() will show you all of the spikes and dips. Or you can go into Grafana's 'Explore' and switch to irate() over a fixed, large enough interval.)

If we wanted to always see short spikes (or dips) even on dashboards covering larger time ranges, we'd have to use the more complicated approach I covered in using Prometheus subqueries to look for spikes in rates. There's no clever choice of interval in Grafana that will get you out of this for all time ranges and situations, and Prometheus currently has no way to find these spikes or dips short of writing out the subquery. Going down this road also requires figuring out if you care about spikes, dips, or both, and if it's both how to represent them on a dashboard graph without overloading it (and yourself).

(Also, the metrics we generally graph with rate() are things that we expect to periodically have short term spikes (often to saturation, for things like CPU usage and network bandwidth). A dashboard calling out that these spikes happened would likely be too noisy to be useful.)

PS: This issue starts exposing a broader issue of what your Grafana dashboards are for, but that's another entry.

Written on 07 August 2020.
« Our problem installing an old Ubuntu kernel set of packages
More problems with Fedora 31 DNF modules and package updates »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 7 22:06:10 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.