Moving averages (and rates) for metrics in Prometheus (and Grafana)

September 30, 2021

Today, I wanted to check if our external mail gateway (which uses a TLS certificate from Let's Encrypt) was seeing a drop in incoming traffic, since there are various problems cropping up as a result of a Let's Encrypt related TLS root certificate expiry (also). We extract metrics from our Exim logs using mtail, so the raw information was available in metrics, but we get a sufficiently modest amount of external email that our natural external mail arrival rate is relatively variable on a minute to minute basis.

When I do metrics graphs of rates or averages (either in Grafana for dashboards or in Prometheus to explore things), I normally set the time interval for things like rate() to the step interval. In Grafana I use $__interval, while in Prometheus I will see what it reports the step interval as and then copy that into the time interval. This means that my graphs are (theoretically) continuous but without oversampling. However, this doesn't work well with metrics like our email arrival rate. If you look at relatively moderate time range like a few hours (such as '8 am to now'), you have a small time interval and a spiky graph that's hard to see long term trends in; if you look at a long time range (a few days), moderate duration trends can disappear into the mush of long time intervals.

So I had what felt like a clever idea: why not use a time interval that was significantly longer than the step interval for once (here 30 minutes or an hour), which I felt would smooth out short term variability but reveal longer term trends more clearly. When I tried it out, it gave me a more or less readable graph that suggested no particularly visible drops in incoming email volume.

I had just used a moving average, which is a well known statistical technique to "smooth out short-term fluctuations and highlight longer-term trends or cycles", to quote from Wikipedia. Specifically I had a simple moving average, in the form that is sometimes called a trailing average (since it only looks backward from your starting point, instead of looking on either side of it).

A trailing average is easy to do in Prometheus graphs as you're exploring things; just set your time interval to either some reasonable value for your purpose or to a value enough higher than your step interval. If your time interval starts approaching your step interval (or becomes smaller than it), you've probably zoomed out to a too large time range. This applies to both explicit averages over time and to rate(), which is effectively a per second average over the time range.

I'm not sure how to do good moving averages in Grafana for dashboards that you want to work over broad ranges in time. If you set a fixed time interval, it will be too small if people expand the dashboard's time interval far enough. Grafana's special $__rate_interval doesn't expand anywhere near wide enough, and as far as I know you can't do math on Grafana variables like $__interval or get a minimum of it and something else. My overall conclusion is that if I use moving averages in a Grafana graph, I'll probably have to say that it only works well for time ranges under some value (and I'll have to poke around to find out what that value is, since the step interval Grafana will use depends on several factors).

(In general moving averages now feel like something I should pay more attention to and make more use of, although I don't want to get too enthusiastic right now and add too many moving average graphs to our dashboards.)

Written on 30 September 2021.
« My changing (citation) style of external links here on Wandering Thoughts
Firefox on Unix is moving away from X11-based remote control »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Sep 30 21:40:20 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.