A gotcha with stale metrics and *_over_time() in Prometheus

April 28, 2019

We monitor the available disk space on our fileservers through custom metrics, and we allow people to set alerts on various levels of low disk space, which are evaluated through a Prometheus alert rule. Because several people are allowed to each set alerts on the same filesystem, we have to use group_right in the alert expression. A simplified version of this expression is:

cslab_avail_bytes <= on (filesystem)
  group_right(fileserver, pool) cslab_alert_minfree

Every so often we migrate a filesystem between fileservers (and ZFS pools). For a long time now, for a relatively short while after each migration, we have been getting a mysterious error from Prometheus about this alert rule:

err="found duplicate series for the match group {filesystem=\"/h/103\"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side

The two duplicate series Prometheus would report were old fileserver and new fileserver versions of the filesystem. What made this mysterious to me was that these two time series never existed at the same time. In every migration, the old time series would end and then a few minutes later the new time series would appear (after the filesystem was migrated, remounted on the machine that generates these metrics, and the metrics generation ran). What it looked like to me was that Prometheus was improperly using a stale metric in alert rule evaluation for some reason.

The explanation ultimately turned out to be very simple. You see, I have simplified the alert expression a little too much, omitting something that I didn't realize made a crucial difference. The real version is more like this:

avg_over_time(cslab_avail_bytes[15m]) <= on (filesystem)
  group_right(fileserver, pool) cslab_alert_minfree

The avg_over_time and its range expression makes all the difference. Even though the old fileserver's time series is stale now, it was non-stale at some time a few minutes in the past, and so the range expression dutifully sweeps it up and avg_over_time produces a result for it, creating a duplicate, and then Prometheus quite correctly errors out.

This applies to all *_over_time() things, of course, and in general to any range vector expression; they will sweep up metrics that were valid at any point over the time range, not just metrics that are valid now (well, at the end of the period). If you want to restrict your range vector or *_over_time() to only metrics that are non-stale at the end of the time period, you need to say so explicitly, by using an 'and' operator:

(avg_over_time(cslab_avail_bytes[15m]) and cslab_avail_bytes) [...]

Depending on your particular circumstances, this could have some subtle effects on such 'migrating' metrics. Early on, the new, non-stale metric will have only a few valid data points within the time range, and I believe that the *_over_time() will happily use only these available data points to compute your average or whatever. That could mean that you are actually averaging over a much smaller time period than you think.

For our purposes, the easier and more straightforward fix is to remove the ``avg_over_time()' and use a 'for: 15m' on the alert rule instead. This doesn't have quite the same effect, but the effects it does have are probably easier to explain to people who are setting and getting low disk space alerts.

Once I thought about it, it was intuitive that range vectors and things that use them will return values for (currently) stale time series, but it hadn't occurred to me before I realized it. In fact I was so oblivious to the possibility that when I asked the Prometheus mailing list about this, I provided the first, over-simplified version of the alert rule, because I didn't think the avg_over_time() mattered. As we see, it very much does.

PS: I think this means that if you have an alert rule using a range vector (in a *_over_time() or otherwise), and the alert is currently firing, and you deal with it by removing the underlying metric (perhaps it's coming from an outdated source), the alert may continue firing for a while because the range vector will keep picking up the old values of the now-stale metric until they fall outside of its time range. But probably not very many people deal with spurious alerts by stopping monitoring the metric source.

(With us, it can happen if a machine is converted from something we're monitoring into an experimental machine that may have wild stuff done to it. We tend not to monitor experimental machines at all, so our natural reaction is to remove the existing monitoring when we get reminded of the machine's change in purpose.)

Written on 28 April 2019.
« Some useful features of (GNU) date for things like time conversion
Notifications and interruptions, and my view on them »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Apr 28 22:52:03 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.