Wandering Thoughts archives

2022-07-31

Using Prometheus's recent '@ end()' PromQL feature to reduce graph noise

Modern versions of Prometheus support a special '@' time modifier on PromQL queries. These let you evaluate a part of the query at a specific, fixed time, rather than at either the 'now' of an instant query or at every step of a range query. In addition to literal times, this can use two special time functions, 'start()' and 'end()', which evaluate to the start and end of a range query.

The @ modifier was introduced (as a then-experimental feature) in Introducing the '@' Modifier, which suggested using it with 'topk()'. The idea here is that this makes it easy to graph the top N (or bottom N, or some other ranking) of things. In terms of Grafana's $__interval and $__range, you'd write a PromQL expression structured like:

rate(thing_bytes[$__interval]) and
  topk(10, rate(thing_bytes[$__range] @ end() ) )

This is a nice usage trick and well worth remembering in any context where you're limiting a graph to only the most interesting N things.

Another case I use '@ end()' for is if I have some metric with a lot of time series, but many of the time series over the range will have no activity. For example, you might have systems with network interfaces that are only active some of the time, or software RAID arrays (or disks) that are frequently idle (or that have too low volume to be interesting). Here, you can use the same approach to only include in your graph metrics that are active over the time period, with a PromQL query structured like:

rate(thing_bytes[$__interval]) and
  ( rate(thing_bytes[$__range] @ end()) > 0 )

You could always do something to exclude such uninteresting time series in PromQL and Grafana, but in my view this new approach is the best one.

This approach is not flawless. For a start, you have to repeat the query twice. Another issue is that if you refresh your Grafana dashboard, a new time series may suddenly show up on the panel. In a sense this is fair, because whatever it is now has enough activity to be interesting to you. In practice it can be a bit annoying to have things jump around because the number of metrics in a panel keeps changing. This is especially potentially an issue if you have Grafana dashboards that are set to automatically update themselves every so often.

(It also assumes implicitly that something having no activity over the time range isn't interesting to you. This isn't always the case. But you have to make choices about what information to present on dashboards; you often can't fit in everything.)

sysadmin/PrometheusAtEndQueryUse written at 22:58:05;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.