A confession about our ZFS configuration

November 13, 2011

When we initially set up our ZFS based fileservers, we didn't know all of the things about disk write caches on our iSCSI backends that we now know. Plus, we read all of that stuff about 'ZFS deals with life even if your disks aren't really trustworthy' and we believed it. Thus we assumed that everything was fine; our iSCSI backends weren't doing any caching, and even if something had gone wrong and they were ZFS was supposed to cope with it. We were wrong.

On the ZFS side, ZFS does not insist that you turn disk write caches off for it but it definitely assumes that it can reliably flush writes to disk one way or another. On the backend side, while the iSCSI software wasn't caching writes itself it didn't turn off the write caches on the disks or pass cache flush operations through to the disks (cf). This meant that we actually had low level write caching (with no forced flushes) happening without knowing it. When we realized this we immediately started working out how to fix it, both through turning off the disk write caches and with an experimental iSCSI backend patch that passed cache flush operations through to the physical disks. We also tested what performance impact both options had. The answer: both fixes worked, but the resulting performance kind of sucked (worse for the 'all write caches off' option).

(In fact the performance drop was a good sign that our fixes actually really worked, ie that they were making the disks write some or all things synchronously.)

We had a big discussion among ourselves and even had a chance to discuss this with some ZFS experts, and our conclusion was that we were going to continue to run with disk write caches. Yes, really. In our specific configuration we would need an unlikely cascade of problems to lose data or an entire pool, and the very low chance of hitting this set of circumstances is not worth the significant write performance degradation we would incur (all of the time) in order to avoid it.

(That qualification is obviously very, very important. You should not even think about doing this unless you have conducted a careful analysis of your own configuration.)

The way you lose data to write ordering issues is that the latest ZFS uberblock is not actually written to disk; the way you lose your pool is that the metadata is not written when the uberblock is. But uberblocks are highly replicated and metadata is somewhat replicated internally, all of our pools are at least two way mirrors, and all of our iSCSI backend disk enclosures are on their own UPS (and only the enclosures, not the iSCSI servers or the ZFS fileservers). If so much as a single copy of the current uberblock or a single copy of the metadata makes it to disk, we survive. To have this not happen, all of the pool's disks would have to lose power essentially simultaneously; either two separate disk enclosures would have to suffer power supply failures at once, or we would have to have a power failure followed by near immediate failure of both UPSes.

(The UPSes don't have to run for very long, just long enough for the disks to write their onboard caches to the platters. Since the iSCSI backends are not on UPSes, the moment the power fails there is no further IO coming in to the disks.)

Note that iSCSI backend crashes or power losses cannot cause problems because the iSCSI backends themselves have no write caches; only the disks do. By the time an iSCSI backend acknowledges the write back to a ZFS fileserver (and the fileserver thinks the write has been committed), the write has been issued to the physical disk and in fact the physical disk has claimed it was done.

Written on 13 November 2011.
« (Not) parsing wikitext
Thinking about how to test our UPSes »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 13 01:54:07 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.