2020-07-28
Our ZFS spares handling system for ZFS on Linux
When we ran Solaris fileservers and then OmniOS fileservers we ended up building our own system for handling replacing failed disks with spares, which I wrote about years ago in part 1, 2, 3, and 4. When we migrated to our current generation of Linux based ZFS fileservers, many of our local software for OmniOS migrated over almost completely unchanged. This included (and includes) our ZFS spares system, which remains mostly unchanged from the Solaris and OmniOS era (both in how it operates and in the actual code involved).
The first important aspect of our spares system is that it is still state driven, not event driven. Rather than trying to hook into ZED to catch and handle events, our spares driver program operates by inspecting the state of all of our pools and attempting to start any disk replacement that's necessary (and possible). We do use ZED to immediately run the spares driver in response to both ZED vdev state change events (which can be a disk failing) and pool resilvers finishing (because a resilver finishing can let us start more disk replacements). We also run the spares driver periodically from cron as a backup to ZED; even if ZED isn't running or misses events for some reason, we will eventually notice problems.
Our Solaris and OmniOS fileservers used iSCSI, so we had to carefully maintain a list of what iSCSI disks were potential spares for each fileserver (a fileserver couldn't necessarily use any iSCSI disk visible to it). Since our Linux fileservers only have local disks, we could get rid of these lists; the spares driver can now use any free disks it sees and its knowledge of available spares is always up to date.
(As before, these 'disks' are actually fixed size partitions on our SSDs, with four partitions per SSD. We are so immersed in our world that we habitually call these 'disks' even though they aren't.)
As in the iSCSI world, we don't pick replacement disks randomly; instead there is a preference system. Our fileservers have half their disks on SATA and half on SAS, and our regular mirrored pairs use the same partition from matching disks (so the first partition on the first SATA disk is in a mirror vdev with the first partition on the first SAS disk). Spare replacement tries to pick a replacement disk partition on the same type of disk (SATA or SAS) as the dead disk; if it can't find one, it falls back to 'any free partition' (which can happen if we use up almost all of the available space on a fileserver, which has already happened on one).
In the past, with HDs over iSCSI, we had to carefully limit the number of resilvers that we did at once in order to not overwhelm the system; our normal limit was replacing only one 'disk' (a partition) at a time. Our experience with local SSDs is that this is no longer really a problem, so now we will replace up to four failed partitions at once, which normally means that if a SSD fails we immediately start resilvers for everything that was on it. This has made a certain amount of old load limiting code in the spares driver basically pointless, but we haven't bothered to remove it.
For inspecting the state of ZFS pools, we continue to rely on our local C program to read out ZFS pool state. It ported from OmniOS to ZFS on Linux with almost no changes, although getting it to compile on Ubuntu 18.04 was a bit of a pain because of how Ubuntu packages ZFS there. It's possible that ZFS on Linux now has official APIs that would provide this information, but our existing code works now so I haven't had any interest in investigating the current state of any official API for ZFS pool information.
Digital microwaves show an example of good UI doing what you wanted
Every so often, you encounter a bit of UI that does what you mean to do so transparently that you don't even notice that the UI is breaking its own 'how it works' rules to do so. You could say that this is the complete reverse of robot logic, as seen yesterday in trying to change your password on Linux. Recently I realized that I'd run into such a 'do what I mean even though it doesn't fit' UI in plain ordinary microwaves, of all places.
Your typical microwave has a 0-9 digital pad for entering the cooking time and its cook time is set in MM:SS. If you want one minute and twenty seconds, you punch in '1 2 0' and the microwave displays '1:20' and counts down from there. This is all perfectly logical and sensible, and forms a clear model of how the microwave behaves.
So what happens if you enter '9 0'? The microwave doesn't reject this as an error because you can't have 90 seconds in the seconds portion of a minutes and seconds time (you can have at most 59). Instead it breaks the model and gives you 90 seconds of cook time. This creates some inconsistencies, of course; if you enter '9 9' you get 99 seconds, but if you enter '1 0 0', you get 60 seconds (because now it's 1:00 cook time). On at least some microwaves this still works even if you enter more than two digits; '1 9 9' is 199 seconds, not an error (or one minute plus 99 seconds).
This behavior was so natural and so obviously correct and what I meant that I spent years not realizing there was anything unusual about it. Only one day as I was keying in '9 0' yet again and congratulating myself on pressing one less digit than '1 3 0' did I stop to ask myself why it even worked. And the answer is that the microwave makers went out of their way to figure out what this input should mean so that it should match user expectations, and make it so.
(I suspect or at least hope that there were user studies by some early microwave company on what people expected to happen when they keyed in various number sequences that weren't proper MM:SS setups.)