2010-06-18
ZFS and multi-pool shared spares
One of the features of ZFS's spares handling is that if you have multiple pools, you can share spare disks between them. This lets you have a single global pool of spares that are used by whichever pool needs them, or more complex schemes if you really want them (you might in, say, a SAN environment).
Our experience is that you want to be really cautious around multi-pool shared spares, because they're buggy and they don't necessarily work very well in practice in some failures. Overall, they seem far more like a first-cut feature than something that is either very useful or thoroughly tested. My strong general impression is that the Solaris engineering effort is almost entirely focused on what they see as the common case, where pools have dedicated spare devices; shared spares are a corner case that gets relatively little attention and development, something where a basic feature was thrown into the code because it looked easy and sort of met a need.
(In fact, our experience has been so negative that we are slowly building our own spare handling system.)
The bugs are the most serious issue. Solaris versions before Solaris 10 update 8 have significant bugs in adding and removing spares such that you can wind up with useless yet 'stuck' spares (we have some painful experience with this) or not be able to remove dead spares from pools. Even Solaris 10 update 8 has not completely fixed the spares problem; we have one system where there is one particular pool that simply will not share spares with other pools.
(If we added a spare to that pool and to any other pool on the system, the spare got a corrupted GUID in one of the pools. All of the other pools on the system could and do share spares with each other.)
Beyond the general issues with ZFS spare handling, shared spares work acceptably for simple situations. If you have a single failure, ZFS will activate a spare in whichever pool needs it, things will resilver, and you will be fine. The problems come when you have a large enough failure that you need more spares than you have, because ZFS (of course) has no idea of prioritization for what disks in what pools get replaced with spares; in the case of simultaneous failures, it basically picks randomly. The result can be essentially useless and as an unpleasant bonus the resilver IO load can destroy your system performance.
(I don't blame ZFS for not handling this case, since how to prioritize spares deployment is a local policy decision, but it does make shared spares less useful in some situations. And it would be nice to be able to have some control over the situation so you could actually implement a local policy; instead ZFS and Solaris has locked everything up inside a series of black boxes.)
2010-06-11
What I know about ZFS and disk write caches
Many people with ZFS experience have probably heard that ZFS is designed to work safely with disks with write caches, and may also have heard that this is done with cache flush commands. The details of what get done when and under what circumstances are a bit complicated, though.
First, I'm only talking about the situation where you give ZFS whole disks, instead of slices. When you do this, ZFS marks the device as being a whole disk in the pool configuration, and the remaining magic is driven from that marker.
When the disk is activated as part of an active pool (through system boot, pool import, being added to the pool, etc), ZFS will ask the disk for its current write cache settings. If the disk has write cache disabled ('WCD'), ZFS will try to enable it. If the disk reports that it has the write cache enabled ('WCE'), it's just left that way.
(At least according to the OpenSolaris source code, ZFS does not attempt to disable the write cache if it's working on a slice of a disk. If you gave ZFS a slice, it just leaves the write cache state alone.)
In operation, ZFS sends cache flushes if and only if it believes that the disk has its write cache enabled, either because the disk was WCE to start with or because ZFS successfully turned on the write cache. ZFS sends these cache flushes regardless of the 'whole disk' state, so they will get sent even if ZFS is using a slice on a disk that had its write cache turned on to start with.
(For SCSI-like disks, Solaris will notice if the disk rejects SYNCHRONIZE CACHE operations and will quietly stop trying to issue more.)
There is an evil tuning option to make ZFS not issue cache flush commands at all. Even without the tuning option on, ZFS will not issue cache flushes at all in at least two cases: if the disk reports that it is WCD and rejects attempts to change this, or if cache flush attempts get rejected by the disk.
(Some SAN RAID arrays apparently have options to reject cache flush commands.)
On the other hand, if your disk reports WCE but also rejects attempts to change its write cache state, ZFS will still send cache flushes. That your disk rejects the MODE SELECT doesn't matter because it never gets sent and besides, ZFS ignores the error in the first place.
Sidebar: a layering technicality
I'm blurring together ZFS itself and the Solaris generic SCSI layer (which supports real SCSI, SATA, iSCSI, and probably other transports). ZFS itself blindly tries to turn on write caches and send flushes; it is the SCSI layer that checks for the disk already being WCE and doesn't send cache flushes to WCD drives or drives that have rejected them.
The ZFS tunable controls the cache flush behavior at the ZFS layer. Modern versions of OpenSolaris also have additional controls at the SCSI layer to selectively not send cache flushes to certain devices; details are covered in the evil tuning guide.
2010-06-02
A ZFS feature wish: rewriting read errors
Today's missing ZFS feature is most easily described by telling you about the problem. Suppose that you have a redundant pool (okay, a redundant vdev) and one of the disks in it develops some bad sectors that can't be read. My current understanding is that this is not a 'replace disk immediately' sign the way that write errors are, and thus can happen on otherwise healthy and usable disks.
(A persistent write error is a 'replace disk immediately' sign because it means that the disk has run out of spare sectors to remap bad sectors to. Modern disks have quite a lot of spare sectors, so seeing an actual errors means that the disk has already had quite a lot of silent write errors that it's fixed up for you.)
Now, you'd like to fix the problem. At the hard drive level, the way to do this is to rewrite the sector so that the hard drive recognizes it as bad and spares it out. Because your pool is redundant, ZFS can recreate the data that should be there and thus it could rewrite the bad sector with the correct data; in fact, if you had a checksum error instead of a read error ZFS would already have done this.
(The hard drive itself can't silently spare out bad sectors on read because it cannot recreate the data that should be in them.)
If ZFS supported doing this rewriting itself, it could fix the problem rapidly and with minimal impact on IO load and pool redundancy. Without ZFS support for rewriting on read errors, you have to fix the problem by hand and the only ZFS-level way to do this (that I know of) is by forcing a full resilver of the device. At a minimum this has a significant IO impact.
(Disclaimer: it's possible that I'm wrong about the danger level of read errors on modern SATA disks. And yes, always immediately replacing disks that report any visible errors may be the cautiously safe approach, but in our environment it has various drawbacks that make us avoid it when possible, including user-visible performance issues as things resilver.)