I should probably reboot BMCs any time they behave oddly

September 8, 2024

Today on the Fediverse I said:

It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.

(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)

I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.

We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.

Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').

As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.

(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)


Comments on this page:

By Anonymous at 2024-09-09 08:44:52:

Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.

True. Fortunately, we now have 'OpenBMC' [1], which may (or may not) alleviate this issue somewhat. [2] [3]

[1] https://github.com/openbmc/openbmc

[2] https://www.phoronix.com/news/NVIDIA-OpenBMC-Contributions

[3] https://www.phoronix.com/review/ampereone-a192-32x

By Miksa at 2024-09-11 07:26:41:

We have a Supermicro server that had a similar problem. Whenever the server was rebooted the BMC also needed a reboot or the server would be stuck on some POST state. This continued for maybe six months, but the server has been behaving properly for a long time for some reason. Don't know why, I don't think it has received any BMC or BIOS updates since before the problems.

During the first couple times I ended up airing the power cords which handled the BMC reboot before I noticed there was easier option.

By UnemployedAdmin at 2024-09-12 18:31:12:

This is actually quite common in embedded devices. I've done quite a bit of troubleshooting getting to the bottom of these type of issues with on-premise devices we had.

In almost all cases I've seen in the past 10 years, almost every weird glitchiness behavior (non-deterministic) that involved embedded devices in the workplace was the result of capacity issues, with the next runner-up being EOL/hardware component failure issues.

For example, the most common failure involved DSL/Cable modem/unmanaged (some low-end managed) legacy networking appliance that was set by default to log RX/TX errors along with a substantial amount of metadata.

This would trigger and log whenever dirty electricity caused an issue at a site cyclically.

The embedded devices would in most cases continue logging (internally) until capacity was used up. In some rare cases with specific vendors, these errors can propagate across a site network.

For most devices this is only visible by physically connecting to a functioning UART/JTAG port on the problem device to get a console before lockup. It will lock up once capacity is hit.

Almost all embedded devices (with few exceptions), run a *nix variant under the hood, with tmpfs mapped to volatile memory.

Once it rolls, the glitching starts. Once you reset, the tmpfs is cleared, and the counter/logging starts over. The rate of time that this happens is often environment (site specific)-vendor implementation paired.

You see effectively the same characteristics with Redhat servers where some greenhorn didn't set up rotating logs, or didn't properly manage capacity.

The lack of electrical isolation from the PSU is problematic for most computing devices. I've come into some sites with real janky setups, like where someone tried to DIY a BMC with an RPI with a standard wall wart (+no interfaced RT clock). SMH.

By UnemployedAdmin at 2024-09-12 19:54:49:

I should clarify that I too tend to use terms IPMI and BMC somewhat interchangeably (imprecisely), my previous response referenced mostly embedded devices in general (including IPMI/KVMoIP embedded devices), but I include BMC (we also used Dell iDrac's which had low level issues as well; often seemingly triggered by power issues).

Written on 08 September 2024.
« I wish (Linux) WireGuard had a simple way to restrict peer public IPs
How ATX power supply control seems to work on desktop motherboards »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Sep 8 23:13:58 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.