Wandering Thoughts archives

2024-09-08

I should probably reboot BMCs any time they behave oddly

Today on the Fediverse I said:

It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.

(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)

I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.

We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.

Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').

As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.

(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)

sysadmin/BMCsCanNeedRebootingII written at 23:13:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.