== What I know about how ZFS actually handles spare disks Like many other RAID-oid systems, ZFS has a notion of spare disks; you can add one or more spare disks to a pool, and ZFS will use them as necessary in order to maintain pool redundancy in the face of disk problems. For details, you can see the _zpool_ manpage. Well, sort of. Actually, how ZFS handles spare disks is significantly different from how normal RAID systems handle them, and the pleasantly bland and normal description of spares in the _zpool_ manpage elides a significant number of important things. The following is what I have been able to gather about the situation from various sources (since Sun doesn't seem to actually document it). In a traditional RAID system with spares, spare handling is part of the main RAID code in the kernel, with spares activated automatically when needed. In Solaris this is not the case; the only thing that the kernel ZFS code does is keep track of the list of spares and some state information about them. Activating a spare is handled by user-level code, which issues the equivalent of '_zpool replace _' through a library call. Specifically, activating ZFS spares is the job of the zfs-retire agent of [[fmd, the Solaris fault manager daemon FaultManagerIrritation]]. (Once zfs-retire activates the spare, the ZFS kernel code handles the rest of the process, including marking the spare in use and setting up the special 'this device is replaced with a spare' vdev. This means that you can duplicate a spare activation by doing a '_zpool replace_' by hand if you ever want to.) In theory, using _fmd_ for this is equivalent to doing it all in the kernel. In practice, your ZFS spare handling is at the mercy of everything working right and it doesn't always do so. For one prominent example, it is up to the zfs-retire module to decide what should cause it to activate a spare, and it has not always done so for everything that degrades a ZFS vdev. My primary sources for all of this are [[this Eric Shrock entry http://blogs.sun.com/eschrock/entry/zfs_hot_spares]] and the archives of the zfs-discuss mailing list. Examination of the OpenSolaris codebase has also been useful (although if you are tempted to do this, beware; it does not necessarily correspond with Solaris 10). === Sidebar: what is required for spare activation In order for a spare to be activated, a great many moving parts of your system have to all be working right. I feel like writing them down (at least the ones that I can think of): * _fmd_ has to be running * _fmd_ has to be getting (and generating) relevant events, which may require various _fmd_ modules to be working correctly * the _zfs-retire_ agent has to be working, and to have subscribed to those events * _zfs-retire_ has to decide that the event is one that should cause it to activate a spare. * _zfs-retire_ has to be able to query the kernel (I think) to get the problem pool's configuration in order to find out what spares are available. ([[This can fail ZFSFailmodeProblem]].) * _zfs-retire_ has to be able to issue the necessary 'replace disk' system call. A further side note on events: in an ideal world, there would be a 'ZFS vdev has been degraded because of device ' event that zfs-retire would listen for. If you think that Solaris lives in this world, I have bad news for you.