2021-09-30
Moving averages (and rates) for metrics in Prometheus (and Grafana)
Today, I wanted to check if our external mail gateway (which uses a TLS certificate from Let's Encrypt) was seeing a drop in incoming traffic, since there are various problems cropping up as a result of a Let's Encrypt related TLS root certificate expiry (also). We extract metrics from our Exim logs using mtail, so the raw information was available in metrics, but we get a sufficiently modest amount of external email that our natural external mail arrival rate is relatively variable on a minute to minute basis.
When I do metrics graphs of rates or averages (either in Grafana
for dashboards or in Prometheus to explore things), I normally set
the time interval for things like rate() to the step interval. In
Grafana I use $__interval
, while
in Prometheus I will see what it reports the step interval as and
then copy that into the time interval. This means that my graphs
are (theoretically) continuous but without oversampling. However,
this doesn't work well with metrics like our email arrival rate.
If you look at relatively moderate time range like a few hours (such
as '8 am to now'), you have a small time interval and a spiky graph
that's hard to see long term trends in; if you look at a long time
range (a few days), moderate duration trends can disappear into the
mush of long time intervals.
So I had what felt like a clever idea: why not use a time interval that was significantly longer than the step interval for once (here 30 minutes or an hour), which I felt would smooth out short term variability but reveal longer term trends more clearly. When I tried it out, it gave me a more or less readable graph that suggested no particularly visible drops in incoming email volume.
I had just used a moving average, which is a well known statistical technique to "smooth out short-term fluctuations and highlight longer-term trends or cycles", to quote from Wikipedia. Specifically I had a simple moving average, in the form that is sometimes called a trailing average (since it only looks backward from your starting point, instead of looking on either side of it).
A trailing average is easy to do in Prometheus graphs as you're
exploring things; just set your time interval to either some
reasonable value for your purpose or to a value enough higher than
your step interval. If your time interval starts approaching your
step interval (or becomes smaller than it), you've probably zoomed
out to a too large time range. This applies to both explicit averages
over time and to rate()
, which is effectively a per second average
over the time range.
I'm not sure how to do good moving averages in Grafana for dashboards
that you want to work over broad ranges in time. If you set a fixed
time interval, it will be too small if people expand the dashboard's
time interval far enough. Grafana's special $__rate_interval
doesn't expand anywhere near wide enough, and as far as I know you
can't do math on Grafana variables like $__interval
or get a
minimum of it and something else. My overall conclusion is that if
I use moving averages in a Grafana graph, I'll probably have to say
that it only works well for time ranges under some value (and I'll
have to poke around to find out what that value is, since the step
interval Grafana will use depends on several factors).
(In general moving averages now feel like something I should pay more attention to and make more use of, although I don't want to get too enthusiastic right now and add too many moving average graphs to our dashboards.)