The challenge of what to set server BIOSes to do on power loss

June 16, 2021

Modern PC BIOSes, including server BIOSes, almost always have a setting for what the machine should do if the power is lost and then comes back. Generally your three options are 'stay powered off', 'turn on', and 'stay in your last state'. Lately I've been realizing that none of them are ideal in our current 'work from home' environment, and the general problem is probably unsolvable without internal remote power control.

In the normal course of events, what we want while working from home is for servers to stay in their last power state. If the power is lost and then comes back, running servers will power back up but servers that we've shut down to take out of service will stay off. If we set servers to 'always turn on', we would have to remember to take servers out of service by powering down their outlet on our smart PDU, not just telling them to halt and power off at the OS level. And of course if we had them set to 'stay powered off', we would have to go in to manually power them up.

But a power loss is not the only case where we might have to take servers down temporarily. We've had one or two scares with machine room air conditioning, and if we had a serious AC issue we would have to (remotely) turn machines off to reduce the heat load. If we turn machines off remotely from the OS level, the BIOS setting of 'stay in your last state' doesn't give us any straightforward way of turning them back on, even with a smart PDU; if we toggle outlet power at the smart PDU, the server BIOS will say 'well I was powered off before so I will stay powered off'. What we need to recover from this situation is what I called internal remote power control, where we can remotely command the machine to turn on.

Right now, if we had an AC issue we would probably have to remember to turn machines off through our smart PDUs instead of at the OS level. With our normal BIOS settings, this would let us remotely restart them through the smart PDU afterward. Since this is very different from our normal procedure for powering off machines, I can only hope that we'd remember to do it in the pressure of a serious AC issue.

(Smart PDUs have a few issues. First, not all of our machines are on them because we don't have enough of them and enough outlets. Second, when you power off a machine this way you're trusting your mapping between PDU ports and actual machines. We think our mapping is trustworthy, but we'd rather not find out the hard way.)

Written on 16 June 2021.
« Some notes on Firefox's media autoplay settings as of Firefox 89
In Prometheus queries, on and ignoring don't drop labels from the result »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jun 16 00:04:11 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.