Our ZFS spares handling system for ZFS on Linux
When we ran Solaris fileservers and then OmniOS fileservers we ended up building our own system for handling replacing failed disks with spares, which I wrote about years ago in part 1, 2, 3, and 4. When we migrated to our current generation of Linux based ZFS fileservers, many of our local software for OmniOS migrated over almost completely unchanged. This included (and includes) our ZFS spares system, which remains mostly unchanged from the Solaris and OmniOS era (both in how it operates and in the actual code involved).
The first important aspect of our spares system is that it is still state driven, not event driven. Rather than trying to hook into ZED to catch and handle events, our spares driver program operates by inspecting the state of all of our pools and attempting to start any disk replacement that's necessary (and possible). We do use ZED to immediately run the spares driver in response to both ZED vdev state change events (which can be a disk failing) and pool resilvers finishing (because a resilver finishing can let us start more disk replacements). We also run the spares driver periodically from cron as a backup to ZED; even if ZED isn't running or misses events for some reason, we will eventually notice problems.
Our Solaris and OmniOS fileservers used iSCSI, so we had to carefully maintain a list of what iSCSI disks were potential spares for each fileserver (a fileserver couldn't necessarily use any iSCSI disk visible to it). Since our Linux fileservers only have local disks, we could get rid of these lists; the spares driver can now use any free disks it sees and its knowledge of available spares is always up to date.
(As before, these 'disks' are actually fixed size partitions on our SSDs, with four partitions per SSD. We are so immersed in our world that we habitually call these 'disks' even though they aren't.)
As in the iSCSI world, we don't pick replacement disks randomly; instead there is a preference system. Our fileservers have half their disks on SATA and half on SAS, and our regular mirrored pairs use the same partition from matching disks (so the first partition on the first SATA disk is in a mirror vdev with the first partition on the first SAS disk). Spare replacement tries to pick a replacement disk partition on the same type of disk (SATA or SAS) as the dead disk; if it can't find one, it falls back to 'any free partition' (which can happen if we use up almost all of the available space on a fileserver, which has already happened on one).
In the past, with HDs over iSCSI, we had to carefully limit the number of resilvers that we did at once in order to not overwhelm the system; our normal limit was replacing only one 'disk' (a partition) at a time. Our experience with local SSDs is that this is no longer really a problem, so now we will replace up to four failed partitions at once, which normally means that if a SSD fails we immediately start resilvers for everything that was on it. This has made a certain amount of old load limiting code in the spares driver basically pointless, but we haven't bothered to remove it.
For inspecting the state of ZFS pools, we continue to rely on our local C program to read out ZFS pool state. It ported from OmniOS to ZFS on Linux with almost no changes, although getting it to compile on Ubuntu 18.04 was a bit of a pain because of how Ubuntu packages ZFS there. It's possible that ZFS on Linux now has official APIs that would provide this information, but our existing code works now so I haven't had any interest in investigating the current state of any official API for ZFS pool information.