A Prometheus gotcha with alerts based on counting things
Suppose, not entirely hypothetically, that you have some backup servers that use swappable HDDs as their backup media and expose that 'media' as mounted filesystems. Because you keep swapping media around, you don't automatically mount these filesystems and when you do manually try to mount them, it's possible to have some missing (if, for example, a HDD didn't get fully inserted and engaged with the hot-swap bay). To deal with this, you'd like to write a Prometheus alert for 'not all of our backup disks are mounted'. At first this looks simple:
count( node_filesystem_size_bytes{ host = "backupserv", mountpoint =~ "/dumps/tapes/slot.*" } ) != <some number>
This will work fine most of the time and then one day it will fail to alert you to the fact that none of the expected filesystems are mounted. The problem is the usual one of PromQL's core nature as a set-based query language (we've seen this before). As long as there's at least one HDD 'tape' filesystem mounted, you can count them, but once there are none, the result of counting them is not 0 but nothing. As a result this alert rule won't produce any results when there are no 'tape' filesystems on your backup server.
Unfortunately there's no particularly good fix, especially if you
have multiple identical backup servers and so the real version uses
'host =~ "bserv1|bserv2|..."
'. In the single-host case, you can
use either absent()
or vector()
to provide a default value. There's no good solution in the multi-host
case, because there's no version of vector() that lets you set labels.
If there was, you could at least write:
count( ... ) by (host) or vector(0, "host", "bserv1") or vector(0, "host", "bserv2") ....
(Technically you can set labels via label_replace(). Let's not go there; it's a giant pain for simply adding labels, especially if you want to add more than one.)
In my particular case, our backup servers always have some additional
filesystems (like their root filesystem), so I can write a different
version of the count()
based alert rule:
count( node_filesystem_size_bytes{ host =~ "bserv1|bserv2|...", fstype =~ "ext.*' } ) by (host) != <other number>
In theory this is less elegant because I'm not counting exactly what I care about (the number of 'tape' filesystems that are mounted) but instead something more general and potentially more variable (the number of extN filesystems that are mounted) that contains various assumptions about the systems. In practice the number is just as fixed as the number of 'taoe' filesystems, and the broader set of labels will always match something, producing a count of at least one for each host.
(This would change if the standard root filesystem type changed in a future version of Ubuntu, but if that happened, we'd notice.)
PS: This might sound all theoretical and not something a reasonably experienced Prometheus person would actually do. But I'm writing this entry partly because I almost wrote a version of my first example as our alert rule, until I realized what would happen when there were no 'tape' filesystems mounted at all, which is something that happens from time to time for reasons outside the scope of this entry.
|
|