A hazard of our old version of OmniOS: sometimes powering off doesn't

June 26, 2019

Two weeks ago, I powered down all of our OmniOS fileservers that are now out of production, which is most of them. By that, I mean that I logged in to each of them via SSH and ran 'poweroff'. The machines disappeared from the network and I thought nothing more of it.

This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.

Today, I logged in to the three that had come back, ran 'poweroff' on them again, and then later went down to the machine room to pull out their power cords. To my surprise, when I looked at the physical machines, they had little green power lights that claimed they were powered on. When I plugged in a roving display and keyboard to check their state, I discovered that all three were still powered on and sitting displaying an OmniOS console message to the effect that they were powering off. Well, they might have been trying to power off, but they weren't achieving it.

I rather suspect that this is what happened two weeks ago, and why these machines all sprang back to life after the power failure. If OmniOS never actually powered the machines off, even a BIOS setting of 'resume last power state after a power failure' would have powered the machines on again, which would have booted OmniOS back up again. Two weeks ago, I didn't go look at the physical servers or check their power state through their lights out management interface; it never occurred to me that 'poweroff' on OmniOS sometimes might not actually power the machine off, especially when the machines did drop off the network.

(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)

At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).

Written on 26 June 2019.
« The convenience (for me) of people writing commands in Python
The death watch for the X Window System (aka X11) has probably started »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 26 01:01:17 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.