2022-07-31
Using Prometheus's recent '@ end()' PromQL feature to reduce graph noise
Modern versions of Prometheus support a special '@' time modifier
on PromQL queries.
These let you evaluate a part of the query at a specific, fixed
time, rather than at either the 'now' of an instant query or at
every step of a range query. In addition to
literal times, this can use two special time functions, 'start()
'
and 'end()
', which evaluate to the start and end of a range query.
The @ modifier was introduced (as a then-experimental feature) in
Introducing the '@' Modifier, which
suggested using it with 'topk()
'. The idea here is that this makes
it easy to graph the top N (or bottom N, or some other ranking) of
things. In terms of Grafana's $__interval
and $__range
,
you'd write a PromQL expression structured like:
rate(thing_bytes[$__interval]) and topk(10, rate(thing_bytes[$__range] @ end() ) )
This is a nice usage trick and well worth remembering in any context where you're limiting a graph to only the most interesting N things.
Another case I use '@ end()' for is if I have some metric with a lot of time series, but many of the time series over the range will have no activity. For example, you might have systems with network interfaces that are only active some of the time, or software RAID arrays (or disks) that are frequently idle (or that have too low volume to be interesting). Here, you can use the same approach to only include in your graph metrics that are active over the time period, with a PromQL query structured like:
rate(thing_bytes[$__interval]) and ( rate(thing_bytes[$__range] @ end()) > 0 )
You could always do something to exclude such uninteresting time series in PromQL and Grafana, but in my view this new approach is the best one.
This approach is not flawless. For a start, you have to repeat the query twice. Another issue is that if you refresh your Grafana dashboard, a new time series may suddenly show up on the panel. In a sense this is fair, because whatever it is now has enough activity to be interesting to you. In practice it can be a bit annoying to have things jump around because the number of metrics in a panel keeps changing. This is especially potentially an issue if you have Grafana dashboards that are set to automatically update themselves every so often.
(It also assumes implicitly that something having no activity over the time range isn't interesting to you. This isn't always the case. But you have to make choices about what information to present on dashboards; you often can't fit in everything.)