Your server BMCs can need to be rebooted every so often

January 14, 2023

Over on the Fediverse I said:

A sysadmin tip: if your BMC/IPMI is doing weird things, restart (reboot) it. Server BMCs are little computers running ancient versions of Linux with software that's probably terribly written and they stay running forever, which means all sorts of opportunities for slow bugs. Reboot away!

This is brought to you by the BMC with a KVM-over-IP that wouldn't accept '2' entered on the (virtual) keyboard in any way or form. Until I rebooted the BMC.
PS: Our IP addresses have 2s in them.

(This probably isn't the only weird BMC glitch we've experienced, but it's the first one where I tried rebooting the BMC and that fixed it.)

A number of people shared additional stories in the replies, and I especially 'liked' @frederic@chaos.social's:

Same for IPMI hardware sensors: Thought the motherboard was damaged because half the sensors were reported as "n/a". Rebooting magically fixed this. 🙈

This happens for more or less the reasons I mentioned above. BMCs naturally accumulate very large uptimes because they don't normally reboot when your server reboots; if you don't do anything special, your BMC will normally stay up for as long as the server has power. In many places this can amount to years of uptime, and it's a rare set of software that can stand up to that even if you don't use them much. Server vendors typically don't want you to think about this, and I don't believe 'BMC uptime' is generally exposed anywhere.

(Routinely querying the BMC's sensor readings via IPMI may actually make this worse, since then the BMC's software is active to answer those queries. I should probably make our metrics system notice when a server decreases the number of IPMI metrics it exposes without a reboot.)

Modern BMCs can generally reboot themselves without rebooting their host (the actual server), although you may want to test this to be sure since apparently some vendors can do that differently.

PS: How I encountered this is that I was reinstalling a server using KVM-over-IP, and I hit the portion of the base Ubuntu 22.04 install when I had to enter the subnet and various associated IP addresses. Our network has a '2' in it, so all of that failed. Helpfully, the KVM-over-IP software had a virtual keyboard so I could see it wasn't just some browser weirdness intercepting a '2' from my real keyboard; even the virtual keyboard's '2' key wouldn't get through to the Ubuntu 22.04 installer running on the server being reinstalled. Since rebooting the BMC didn't reboot the host, I could verify that rebooting the BMC alone fixed the problem; when the BMC rebooted, my KVM-over-IP session could now enter all digits.

(I'm glad that it occurred to me to reboot the BMC, instead of just grumble and go down to the machine room to do the install with the physical console.)

Written on 14 January 2023.
« Ubuntu 22.04 LTS servers and phased apt updates
Some weird effects you can get from shared Let's Encrypt accounts »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 14 22:02:49 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.