Thinking about how to test our UPSes

November 13, 2011

In light of my confession about our handling of ZFS disk sync, one of the vital things in our environment is working UPSes for the iSCSI backend disks. Without UPSes a building power failure would cause both sides of a mirror to lose power simultaneously, which removes our protection against losing in flight ZFS uberblock and metadata writes.

Now, one of the problems with UPSes is that their batteries eventually wear out. Often you get no advance warning about this having happened; you only find out when you lose main power and the UPS immediately shuts down. This means that you need to test UPSes every so often to make sure that they still work. For obvious reasons you don't want to do this live, with a production machine depending on the UPS.

(I would argue that you don't really want to do this even if your production machines have separately powered dual power supplies, but it's fuzzy.)

Our iSCSI backend disk shelves don't have dual power supplies, but they do have the next best thing; they're all on automatic transfer switches, with the main power supply fed from line power and the secondary power feed coming from the UPS (we did this after running into previous UPS problems). This means that in theory we can actually test all of our UPSes without having to schedule a downtime.

To do the testing we would first put an extra, unused UPS into every rack and test this UPS to insure that the battery was good. Then for each disk unit's UPS we would move the UPS side of the disk unit's transfer switch to this new known-good UPS, test the disk unit's normal UPS, and when it passes move the disk unit's UPS side back to its normal UPS. At least in theory our exposure would be limited to having a power failure in the small interval between unplugging a disk unit's UPS power feed from one UPS and plugging it into another.

(This is me thinking aloud. I don't know if we'll actually do this or if we'll want to schedule a downtime for the testing just in case something goes wrong. Certainly testing with a scheduled downtime is going to be clearly safer, because we can take the fileservers down so that there won't be problems if a backend's disks abruptly lose power due to some problem. Sometimes system administration is about tradeoffs and the balance of risks.)

Written on 13 November 2011.
« A confession about our ZFS configuration
A scroll wheel experiment »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 13 23:41:36 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.