A sysadmin learning experience courtesy of some UPS issues

October 30, 2020

Back in the summer I bought a reasonably nice UPS for my home setup (cf), one that's big enough to give my relatively modest setup more than an hour of runtime if the power went out. Last night my area had its first power failure since I'd put the UPS into place. At first everything was great and I was quite happy, but then things went not so well.

First, just a minute later my home machine abruptly went into a system shutdown, eventually powering off. This was despite there being plenty of UPS battery power remaining. Second, the UPS itself counted down ten minutes and then shut all power off (turning it back on ten seconds later), even though power came back on within those ten minutes (the actual power outage lasted only about six minutes, which is typical my area).

These behaviors were what you would call highly undesirable. They converted what would have otherwise been a non-event into one where I risked data loss from having programs abruptly terminated (including on the other side of SSH sessions), and then had my DSL link go down and have to be slowly re-established because power was cut to the DSL modem. Fortunately (sort of), both of these behaviors turned out to be due to the vendor's UPS software, so I was able to fix them.

How did I get into this situation, where my UPS had some extremely surprising behavior during a power failure? The simple answer is that I never tested my UPS in its final system configuration. Although I did test how my UPS worked when I first set it up, that was before I installed the vendor's UPS software. When I installed the software, I tacitly assumed both that its default behavior was sensible and that it wouldn't change what the UPS itself did. It turns out that neither are true.

(I say that I tacitly assumed this because I didn't even think about it. It never occurred to me that I should re-test a power loss scenario now that I had the software set up.)

This is a general rule that I keep having to re-learn: you need to test the final system configuration, or as close to it as you can get. Harmless differences and changes in system setup aren't always harmless (for instance, how Linux distribution installers behave can vary between virtual hardware and real hardware). You can often get away with it (which unfortunately encourages doing 'close enough' testing), but every once in a while you'll get burned.

(This elaborates on some tweets of mine.)

Written on 30 October 2020.
« An illustration of why running code during import is a bad idea (and how it happens anyway)
Some settings you want to make to CyberPower's UPS Powerpanel daemon »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 30 00:04:45 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.