An update on faulted ZFS spares
We've recently got some additional pieces of news on the faulted ZFS spares situation.
First, our suspicion as to the cause was correct; Sun has confirmed that there is a race in adding the same spare to multiple pools under some circumstances. The fix for it is apparently in Solaris 10 update 8, and Sun did an 'IDR' for us for our Solaris 10 update 6 systems. (I assume but have not confirmed that just applying the current set of ZFS patches and their prerequisites is good enough.)
Second, Solaris 10 update 8 can properly '
zpool remove' faulted
spares from pools, so even if Sun has not completely fixed all of the
spares-related races yet you can recover from the situation yourself.
Again, it's likely that this fix is in the current set of ZFS patches
(and Sun put it in our IDR).
(Mind you, since the current set of ZFS patches depend on a kernel rollup patch, installing them is not all that far from a full upgrade to S10U8 as far as we're concerned, because in our NFS fileserver environment kernel and ZFS patches are by far the most risky ones. Although not always, and sadly that particular bug is still in S10U8.)
However, the more I have seen of how Sun handles ZFS pool spares in general, the less confidence I have in it working properly when we need it. Right now I consider ZFS's own spare handling to be at most an emergency measure; it's the sort of thing that gets you from the middle of the night to the morning when you read your email, not something that you let sit and handle problems on its own.