2011-11-13
Thinking about how to test our UPSes
In light of my confession about our handling of ZFS disk sync, one of the vital things in our environment is working UPSes for the iSCSI backend disks. Without UPSes a building power failure would cause both sides of a mirror to lose power simultaneously, which removes our protection against losing in flight ZFS uberblock and metadata writes.
Now, one of the problems with UPSes is that their batteries eventually wear out. Often you get no advance warning about this having happened; you only find out when you lose main power and the UPS immediately shuts down. This means that you need to test UPSes every so often to make sure that they still work. For obvious reasons you don't want to do this live, with a production machine depending on the UPS.
(I would argue that you don't really want to do this even if your production machines have separately powered dual power supplies, but it's fuzzy.)
Our iSCSI backend disk shelves don't have dual power supplies, but they do have the next best thing; they're all on automatic transfer switches, with the main power supply fed from line power and the secondary power feed coming from the UPS (we did this after running into previous UPS problems). This means that in theory we can actually test all of our UPSes without having to schedule a downtime.
To do the testing we would first put an extra, unused UPS into every rack and test this UPS to insure that the battery was good. Then for each disk unit's UPS we would move the UPS side of the disk unit's transfer switch to this new known-good UPS, test the disk unit's normal UPS, and when it passes move the disk unit's UPS side back to its normal UPS. At least in theory our exposure would be limited to having a power failure in the small interval between unplugging a disk unit's UPS power feed from one UPS and plugging it into another.
(This is me thinking aloud. I don't know if we'll actually do this or if we'll want to schedule a downtime for the testing just in case something goes wrong. Certainly testing with a scheduled downtime is going to be clearly safer, because we can take the fileservers down so that there won't be problems if a backend's disks abruptly lose power due to some problem. Sometimes system administration is about tradeoffs and the balance of risks.)
A confession about our ZFS configuration
When we initially set up our ZFS based fileservers, we didn't know all of the things about disk write caches on our iSCSI backends that we now know. Plus, we read all of that stuff about 'ZFS deals with life even if your disks aren't really trustworthy' and we believed it. Thus we assumed that everything was fine; our iSCSI backends weren't doing any caching, and even if something had gone wrong and they were ZFS was supposed to cope with it. We were wrong.
On the ZFS side, ZFS does not insist that you turn disk write caches off for it but it definitely assumes that it can reliably flush writes to disk one way or another. On the backend side, while the iSCSI software wasn't caching writes itself it didn't turn off the write caches on the disks or pass cache flush operations through to the disks (cf). This meant that we actually had low level write caching (with no forced flushes) happening without knowing it. When we realized this we immediately started working out how to fix it, both through turning off the disk write caches and with an experimental iSCSI backend patch that passed cache flush operations through to the physical disks. We also tested what performance impact both options had. The answer: both fixes worked, but the resulting performance kind of sucked (worse for the 'all write caches off' option).
(In fact the performance drop was a good sign that our fixes actually really worked, ie that they were making the disks write some or all things synchronously.)
We had a big discussion among ourselves and even had a chance to discuss this with some ZFS experts, and our conclusion was that we were going to continue to run with disk write caches. Yes, really. In our specific configuration we would need an unlikely cascade of problems to lose data or an entire pool, and the very low chance of hitting this set of circumstances is not worth the significant write performance degradation we would incur (all of the time) in order to avoid it.
(That qualification is obviously very, very important. You should not even think about doing this unless you have conducted a careful analysis of your own configuration.)
The way you lose data to write ordering issues is that the latest ZFS uberblock is not actually written to disk; the way you lose your pool is that the metadata is not written when the uberblock is. But uberblocks are highly replicated and metadata is somewhat replicated internally, all of our pools are at least two way mirrors, and all of our iSCSI backend disk enclosures are on their own UPS (and only the enclosures, not the iSCSI servers or the ZFS fileservers). If so much as a single copy of the current uberblock or a single copy of the metadata makes it to disk, we survive. To have this not happen, all of the pool's disks would have to lose power essentially simultaneously; either two separate disk enclosures would have to suffer power supply failures at once, or we would have to have a power failure followed by near immediate failure of both UPSes.
(The UPSes don't have to run for very long, just long enough for the disks to write their onboard caches to the platters. Since the iSCSI backends are not on UPSes, the moment the power fails there is no further IO coming in to the disks.)
Note that iSCSI backend crashes or power losses cannot cause problems because the iSCSI backends themselves have no write caches; only the disks do. By the time an iSCSI backend acknowledges the write back to a ZFS fileserver (and the fileserver thinks the write has been committed), the write has been issued to the physical disk and in fact the physical disk has claimed it was done.