2010-06-11
What I know about ZFS and disk write caches
Many people with ZFS experience have probably heard that ZFS is designed to work safely with disks with write caches, and may also have heard that this is done with cache flush commands. The details of what get done when and under what circumstances are a bit complicated, though.
First, I'm only talking about the situation where you give ZFS whole disks, instead of slices. When you do this, ZFS marks the device as being a whole disk in the pool configuration, and the remaining magic is driven from that marker.
When the disk is activated as part of an active pool (through system boot, pool import, being added to the pool, etc), ZFS will ask the disk for its current write cache settings. If the disk has write cache disabled ('WCD'), ZFS will try to enable it. If the disk reports that it has the write cache enabled ('WCE'), it's just left that way.
(At least according to the OpenSolaris source code, ZFS does not attempt to disable the write cache if it's working on a slice of a disk. If you gave ZFS a slice, it just leaves the write cache state alone.)
In operation, ZFS sends cache flushes if and only if it believes that the disk has its write cache enabled, either because the disk was WCE to start with or because ZFS successfully turned on the write cache. ZFS sends these cache flushes regardless of the 'whole disk' state, so they will get sent even if ZFS is using a slice on a disk that had its write cache turned on to start with.
(For SCSI-like disks, Solaris will notice if the disk rejects SYNCHRONIZE CACHE operations and will quietly stop trying to issue more.)
There is an evil tuning option to make ZFS not issue cache flush commands at all. Even without the tuning option on, ZFS will not issue cache flushes at all in at least two cases: if the disk reports that it is WCD and rejects attempts to change this, or if cache flush attempts get rejected by the disk.
(Some SAN RAID arrays apparently have options to reject cache flush commands.)
On the other hand, if your disk reports WCE but also rejects attempts to change its write cache state, ZFS will still send cache flushes. That your disk rejects the MODE SELECT doesn't matter because it never gets sent and besides, ZFS ignores the error in the first place.
Sidebar: a layering technicality
I'm blurring together ZFS itself and the Solaris generic SCSI layer (which supports real SCSI, SATA, iSCSI, and probably other transports). ZFS itself blindly tries to turn on write caches and send flushes; it is the SCSI layer that checks for the disk already being WCE and doesn't send cache flushes to WCD drives or drives that have rejected them.
The ZFS tunable controls the cache flush behavior at the ZFS layer. Modern versions of OpenSolaris also have additional controls at the SCSI layer to selectively not send cache flushes to certain devices; details are covered in the evil tuning guide.