Our ZFS spares handling system (part 3)

July 20, 2011

In part 1 I mentioned that our spares system pulls what disks to use as spares from files and how the files are maintained was beyond the scope of the entry. Well, time to talk about that.

From more or less the beginning of our ZFS fileserver system we've had an administrative system that captured a record of all pools on each physical server and a list of all disks visible to that server and how they were being used by ZFS. This system is relatively crude; it's shell scripts that run once a day and then scp their output to a central location (which is then replicated back to all fileservers). By combining the information from all of the local disk usage files, the spares file building system can get a global view of both what disks exist and how they're used.

(At this point I will pause to note that all through our system we translate iSCSI disk names from the local Solaris cNt... names to symbolic names that have the iSCSI target and logical disk involved. This is a hugely important step and avoids so many potential problems.)

Although we have disk usage information for physical servers, the spares files are built for our virtual fileservers; each virtual fileserver has its own list of spares, even if two virtual fileservers happen to be using the same physical server at the moment. We do this because each of our iSCSI backends is typically dedicated to a single virtual fileserver and we want to keep things that way even when we have to activate spares. The overall spares handling environment goes to some pains to make this work.

The whole process of building the spares files for the virtual fileservers is controlled by a configuration file with directives. There are two important sorts of directives in the file:

fs8 use backend lincoln

This means that the virtual fileserver fs8 should use as spare disks any unused disks on the iSCSI backend called lincoln.

all exclude pool fs2-core-01

This means that all virtual fileservers should avoid (as spares) any logical disks that share a physical disk with a disk used by the ZFS pool fs2-core-01 (which happens to be the pool that hosts our /var/mail, which is quite sensitive to the increased IO load of a resilver).

(There are variants of these directives that allow us to be more specific about things, but in practice we don't need them.)

The spares-files build process is run on a single machine from cron, normally once a day. This low frequency of automated rebuilds is generally perfectly fine because disk usage information changes only very slowly. If we're replacing a backend there is a series of steps we have to do by hand to get the spares files rebuilt promptly, but that's an exceptional circumstance.

In theory we could have put all of this initial spares selection logic straight into the spares handling program. In practice, I feel that there's a very strong reason to keep them separate (in addition to this making both aspects of the spares problem simpler). Since we want each potential spare disk to only ever be usable by a single virtual fileserver, overall spares selection is inherently a global process. Global processes should be done once and centrally, because this avoids any chance that two systems will ever do them separately and disagree over what the answer should be. If we only ever generate spare disk lists in one place, we have a strong assurance that only serious program bugs will ever cause two fileservers to think that they can use the same disk as a spare. If the fileservers did this themselves, there are all sorts of (de)synchronization issues that could cause such duplication.

(We can also post-process the output files to check that this constraint holds true.)

Written on 20 July 2011.
« Thinking about when SQL normalization can improve performance
Why I would like my mailer to have a real programming language (part 2) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 20 00:16:01 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.