How we choose our time intervals in our Grafana dashboards
In a comment on my entry on our Prometheus and Grafana setup, trallnag asked a good question:
Would you mind sharing your concrete approach to setting the time intervals for functions like rate() and increase()?
This is a good question, because trallnag goes on to cover why this is an issue you may want to think about:
I tend to switch between using $__interval, completely fixed values like 5m or a Grafana interval variable with multiple interval to choose from. None are perfect and all fail in certain circumstances, ranging from missing spikes with $__interval to under or oversampling with custom intervals.
The very simple answer is that so far I've universally used $__interval, which is Grafana's templating variable for 'whatever the step is on this graph given the time scale you're currently covering'. Using $__interval means that your graph is (theoretically) continuous but without oversampling; every moment in time is used for one and only one graph point.
The more complete answer is that we use $__interval but often
tell Grafana that there is a minimum interval for the query that
is usually slightly larger than how often we generate the metric.
When you use
increase(), and their kin, you need to
make sure that your interval always has at least two metric points,
otherwise they give you no value and your graphs look funny. Since
we're using variable intervals, we have to set the minimum interval.
In a few graphs I've experimented with combining
rate( ...[$__interval] ) or irate( ...[4m] )
The idea here is that if the interval is too short to get two metric
rate() will generate nothing and we fall through to
irate(), which will give us the rate across the two most recent
metric points (see
Unfortunately, this is both annoying to write (since you have to
repeat your metric condition) and inefficient (since Prometheus
will always evaluate both the
rate() and the
irate()), so I've
mostly abandoned it.
The high level answer is that we use $__interval because I don't have a reason to make things more complicated. Our Grafana dashboards are for overviews (even detailed overviews), not narrow troubleshooting, and I feel that for this a continuous graph is generally the most useful. It's certainly the easiest to make work at both small and large timescales (including ones like 'the last week'). We're also in the position where we don't care specifically about the rate of anything over a fixed interval (eg, 'error rate in the last 5 minute should be under ...'), and probably don't care about momentary spikes, especially when we're using a large time range with a dashboard.
(Over a small time range, a continuous graph of
rate() will show you
all of the spikes and dips. Or you can go into Grafana's 'Explore' and
irate() over a fixed, large enough interval.)
If we wanted to always see short spikes (or dips) even on dashboards covering larger time ranges, we'd have to use the more complicated approach I covered in using Prometheus subqueries to look for spikes in rates. There's no clever choice of interval in Grafana that will get you out of this for all time ranges and situations, and Prometheus currently has no way to find these spikes or dips short of writing out the subquery. Going down this road also requires figuring out if you care about spikes, dips, or both, and if it's both how to represent them on a dashboard graph without overloading it (and yourself).
(Also, the metrics we generally graph with
rate() are things that we
expect to periodically have short term spikes (often to saturation, for
things like CPU usage and network bandwidth). A dashboard calling out
that these spikes happened would likely be too noisy to be useful.)
PS: This issue starts exposing a broader issue of what your Grafana dashboards are for, but that's another entry.