Our ZFS spares handling system (part 3)
In part 1 I mentioned that our spares system pulls what disks to use as spares from files and how the files are maintained was beyond the scope of the entry. Well, time to talk about that.
From more or less the beginning of our ZFS fileserver system we've had an administrative system that captured
a record of all pools on each physical server and a list of all disks
visible to that server and how they were being used by ZFS. This system
is relatively crude; it's shell scripts that run once a day and then
scp their output to a central location (which is then replicated back
to all fileservers). By combining the information from all of the local
disk usage files, the spares file building system can get a global view
of both what disks exist and how they're used.
(At this point I will pause to note that all through our system we
translate iSCSI disk names from the local Solaris
cNt... names to
symbolic names that have the iSCSI target and logical disk involved.
This is a hugely important step and avoids so many potential problems.)
Although we have disk usage information for physical servers, the spares files are built for our virtual fileservers; each virtual fileserver has its own list of spares, even if two virtual fileservers happen to be using the same physical server at the moment. We do this because each of our iSCSI backends is typically dedicated to a single virtual fileserver and we want to keep things that way even when we have to activate spares. The overall spares handling environment goes to some pains to make this work.
The whole process of building the spares files for the virtual fileservers is controlled by a configuration file with directives. There are two important sorts of directives in the file:
fs8 use backend lincoln
This means that the virtual fileserver
fs8 should use as spare disks
any unused disks on the iSCSI backend
all exclude pool fs2-core-01
This means that all virtual fileservers should avoid (as spares) any
logical disks that share a physical disk with a disk used by the
fs2-core-01 (which happens to be the pool that hosts our
/var/mail, which is quite sensitive to the increased IO load of a
(There are variants of these directives that allow us to be more specific about things, but in practice we don't need them.)
The spares-files build process is run on a single machine from cron, normally once a day. This low frequency of automated rebuilds is generally perfectly fine because disk usage information changes only very slowly. If we're replacing a backend there is a series of steps we have to do by hand to get the spares files rebuilt promptly, but that's an exceptional circumstance.
In theory we could have put all of this initial spares selection logic straight into the spares handling program. In practice, I feel that there's a very strong reason to keep them separate (in addition to this making both aspects of the spares problem simpler). Since we want each potential spare disk to only ever be usable by a single virtual fileserver, overall spares selection is inherently a global process. Global processes should be done once and centrally, because this avoids any chance that two systems will ever do them separately and disagree over what the answer should be. If we only ever generate spare disk lists in one place, we have a strong assurance that only serious program bugs will ever cause two fileservers to think that they can use the same disk as a spare. If the fileservers did this themselves, there are all sorts of (de)synchronization issues that could cause such duplication.
(We can also post-process the output files to check that this constraint holds true.)