A gotcha with stale metrics and *_over_time() in Prometheus
We monitor the available disk space on our fileservers through custom metrics, and we
allow people to set alerts on various levels of low disk space,
which are evaluated through a Prometheus alert rule. Because several
people are allowed to each set alerts on the same filesystem, we
have to use group_right
in the alert expression. A simplified
version of this expression is:
cslab_avail_bytes <= on (filesystem) group_right(fileserver, pool) cslab_alert_minfree
Every so often we migrate a filesystem between fileservers (and ZFS pools). For a long time now, for a relatively short while after each migration, we have been getting a mysterious error from Prometheus about this alert rule:
err="found duplicate series for the match group {filesystem=\"/h/103\"} on the left hand-side of the operation: [...]; many-to-many matching not allowed: matching labels must be unique on one side
The two duplicate series Prometheus would report were old fileserver and new fileserver versions of the filesystem. What made this mysterious to me was that these two time series never existed at the same time. In every migration, the old time series would end and then a few minutes later the new time series would appear (after the filesystem was migrated, remounted on the machine that generates these metrics, and the metrics generation ran). What it looked like to me was that Prometheus was improperly using a stale metric in alert rule evaluation for some reason.
The explanation ultimately turned out to be very simple. You see, I have simplified the alert expression a little too much, omitting something that I didn't realize made a crucial difference. The real version is more like this:
avg_over_time(cslab_avail_bytes[15m]) <= on (filesystem) group_right(fileserver, pool) cslab_alert_minfree
The avg_over_time and its range expression makes all the difference. Even though the old fileserver's time series is stale now, it was non-stale at some time a few minutes in the past, and so the range expression dutifully sweeps it up and avg_over_time produces a result for it, creating a duplicate, and then Prometheus quite correctly errors out.
This applies to all *_over_time() things, of course, and in
general to any range vector expression; they will sweep up metrics
that were valid at any point over the time range, not just metrics
that are valid now (well, at the end of the period). If you want
to restrict your range vector or *_over_time() to only metrics
that are non-stale at the end of the time period, you need to say
so explicitly, by using an 'and
' operator:
(avg_over_time(cslab_avail_bytes[15m]) and cslab_avail_bytes) [...]
Depending on your particular circumstances, this could have some subtle effects on such 'migrating' metrics. Early on, the new, non-stale metric will have only a few valid data points within the time range, and I believe that the *_over_time() will happily use only these available data points to compute your average or whatever. That could mean that you are actually averaging over a much smaller time period than you think.
For our purposes, the easier and more straightforward fix is to
remove the ``avg_over_time()' and use a 'for: 15m
' on the alert
rule instead. This doesn't have quite the same effect, but the
effects it does have are probably easier to explain to people who
are setting and getting low disk space alerts.
Once I thought about it, it was intuitive that range vectors and things that use them will return values for (currently) stale time series, but it hadn't occurred to me before I realized it. In fact I was so oblivious to the possibility that when I asked the Prometheus mailing list about this, I provided the first, over-simplified version of the alert rule, because I didn't think the avg_over_time() mattered. As we see, it very much does.
PS: I think this means that if you have an alert rule using a range vector (in a *_over_time() or otherwise), and the alert is currently firing, and you deal with it by removing the underlying metric (perhaps it's coming from an outdated source), the alert may continue firing for a while because the range vector will keep picking up the old values of the now-stale metric until they fall outside of its time range. But probably not very many people deal with spurious alerts by stopping monitoring the metric source.
(With us, it can happen if a machine is converted from something we're monitoring into an experimental machine that may have wild stuff done to it. We tend not to monitor experimental machines at all, so our natural reaction is to remove the existing monitoring when we get reminded of the machine's change in purpose.)
|
|