Some notes on Grafana annotations sourced from Prometheus metrics

August 22, 2022

Grafana Annotations have long been one of those 'I should look into this sometime' Grafana features that seemed potentially useful but not immediately compelling, and also a bunch of work to set up. Recently I learned (or re-learned) that you can dynamically generate annotations from Prometheus metrics and other data sources, and spent some time experimenting with this, not always successfully. As a result, I have some notes and some opinions. I'll start with the bad news.

Grafana has two sorts of annotations, basic ones that are a single point in time (for example, 'a new configuration was deployed at this time') and region annotations, which cover a span of time (for example, 'an alert was firing'). Unfortunately, you can't currently generate region annotations from Prometheus metrics; if you try, for example by setting an annotation on the Prometheus 'ALERTS' metric (as the Grafana UI for Prometheus based annotations will lead you to try), the results are unpleasant. The only Prometheus based annotations you can use are single point in time ones. Generally this means that you want a Prometheus metric that is the time something happened, such as when a host rebooted (ie, node_boot_time_seconds) or an alert started (the ALERTS_FOR_STATE metric). Because Grafana deals in milliseconds, you need to multiply these 'time in seconds' metrics by 1000. There's a helpful tooltip in the Grafana UI to remind you of this.

(You may be able to create a time series that's only present at the point where something has happened with creative use of PromQL functions like changes() and resets() (also), combined with '> 0' and the like to filter out most data points entirely. However, I haven't tried to build any such annotation queries yet.)

Once I had working annotations for things like host reboots and the start of host alerts, I discovered that having Prometheus based annotations present makes many Grafana panel types clearly less responsive initially, with issues like mouse cursor tooltips being clearly laggy. This does go away after a while if you let your dashboard sit, but if it refreshes (for example if you have it set to refresh every minute), the lag comes back. This unpleasant lag means that in practice, all annotations need to be able to be turned on or off and so far I'm defaulting all of ours to off.

(I suspect that this lag issue isn't present for native Grafana annotations, which presumably are sent to the frontend as either a single point in time or as a defined, straightforward range.)

In my unscientific experimentation, the most helpful thing to reduce or eliminate the lag is for your annotation query to return as few data points as possible, ideally none. One way to do this for Prometheus metrics that are a time in seconds is to simply make them check to see if they can possibly be in range based on the start time of the Grafana dashboard (which Grafana provides in global variables). This means writing queries like:

( node_boot_time_seconds {host="$host"} * 1000 ) >= ${__from}

If you're looking at a dashboard time range that doesn't have any reboots in it (which is the common case for us), your annotation query will generate no time series points and everything remains pretty snappy.

An unfortunate limitation of annotations and annotation queries right now is that you can't group multiple queries together so that they can all be turned on and off by a single control. Since you do need to be able to turn annotation queries on and off (due to their performance impact), this can lead to a profusion of on/off controls that eat up the limited space at the top of your dashboards. In turn this makes it annoying to have too many annotation queries on one dashboard; you may need to pick and choose which ones are the most important.

One option to deal with this is to create merged synthetic merged metrics with recording rules. For example, if you have one metric for the start time of ZFS pool scrubs and another for the end time, you could create a recording rule to make a new metric that merges them together using a new label, creating say:

zfs_pool_scan_mark_time_seconds { type="start|end" .... } ...

You can then make one query for this new metric in your dashboards and put a '{{type}}' into the annotation text somewhere to tell people whether this is the start or the end of the pool scrub, and control all of this with one toggle instead of two.

(You can also do a more awkward version of this in a query by using label_replace() and an or to join the two metrics together. But that gets verbose and hard to keep track of very rapidly; I'd rather consider a recording rule. At least I can put comments on a recording rule, unlike queries in Grafana.)

PS: I'm also forming some opinions on Grafana annotations in general, but those are for a future entry.

Written on 22 August 2022.
« ZFS DVAs and what they cover on raidz vdevs
On Ubuntu, AppArmor is quite persistent and likes to reappear on you »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 22 23:01:06 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.