Setting alerts is a chance to figure out what you really care about
Our ZFS fileservers divide up their data disks into standard sized partitions, which are grouped into mirrored pairs in ZFS pools. Our custom spares handling system can make use of any available unused partition to replace a partition on a failed disk, so we don't need to keep a specific disk spare in case of a disk failure; we just need enough spare partitions. A few years ago we arranged to push a count of how many spare partitions each fileserver had into our metrics system and then trigger an alert if it dropped too low on any fileserver, where too low is currently 4 partitions (enough to automatically replace any single failed disk). Today that alert duly triggered to inform us that we had '0' spare partitions on a new 22.04 fileserver we had just installed. This was because said fileserver didn't have any data disks at all; no data disks means no spare partitions and also no used partitions to need sparing out.
My quick solution was to also push the total number of data disk partitions into Prometheus and then only alert on too-few spares if we had partitions at all. However, we're in the process of upgrading fileservers from 2 TB SATA SSDs to 4 TB SATA SSDs, which have eight standard sized partitions instead of four, and so soon a mere four spare partitions will be inadequate on some fileservers. This set me to thinking about what additional data about partition usage we might want to push into metrics, and what exactly we should be alerting on. The question of what condition (or conditions) we should be alerting on for remaining spares is really a question of what we really care about in this situation.
(The question of what we want to alert on also drives what metrics we should collect about the situation. We already collect detailed partition usage information elsewhere, down to the specifics of what a given partition is used for.)
Some of this is a policy question. For example, once we have two or more 4 TB SSDs in a fileserver, should the minimum number of spare partitions rise to eight even if we only have at most four partitions in use on any one disk? We could also adopt a policy that we always want to have four spare partitions even if a fileserver only has a few disks and a few partitions used. But we could also decide to set different thresholds, such as being able to replace the biggest disk on a fileserver, or being able to replace the most heavily used one (which might not have all partitions used).
(Determining if you can replace your most heavily used disk is a little trickier than it looks, because you can't include spare partitions from that disk. If a fileserver has one 4 TB SSD and the rest 2 TB SSDs, the four spare partitions on the 4 TB SSD kind of don't count; if they were the only spares left, we would nominally have enough spare partitions to replace any other SSD that failed, but we couldn't handle the 4 TB SSD failing.)
Another question and corner case is whether we want to raise any alert if there are no data disk partitions in use on the fileserver. This won't normally come up, but it could if we build test fileservers, and it would also subsume the 'there are no data disk partitions at all' case.
(Writing this entry is showing me that it's not as simple as I thought to think through all of the potential cases.)
I haven't come up with answers so far, but I suspect the eventual result will be that we'll decide we want four spare partitions if we have only 2 TB SSDs in the fileserver and eight spare partitions the moment we have two or more 4 TB SSDs in the fileserver. Then I'll have to figure out how to generate metrics so that we can set up the necessary alerting rules.
(This alert is only intended as a backup, partly to let us know what the spare situation is after a single disk fails. Because allocating partitions to ZFS is a permanent operation that can't be undone, we always need to check how many free partitions we'll have left before we do it.)
PS: I said 'setting' in the title of this entry, but any time you need to update an alert is also a good time to think again about what you really care about. Here we're updating it after it mis-fired, and it mis-fired because we didn't fully understand the situation and what we were actually alerting on, instead of what we thought we were.
(This is related to how alerts have intentions, although the actual concrete alert may not actually implement those intentions perfectly.)