2024-12-21
When power cycling your (x86) server isn't enough to recover it
We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.
Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:
!!!! X64 Exception Type - 12(#MC - Machine-Check) CPU Apic ID - 00000000 !!!! RIP - 000000006DABA5A5, CS - 0000000000000038, RFLAGS - 0000000000010087 RAX - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001 RBX - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520 RSI - 0000000000000008, RDI - 000000006AB1B1B0 R8 - 000000005DCCF524, R9 - 000000005D29E850, R10 - 000000005D29E8E4 R11 - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001 R14 - 0000000000000028, R15 - 0000000000000000 DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 GS - 0000000000000030, SS - 0000000000000030 CR0 - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000 CR4 - 0000000000000668, CR8 - 0000000000000000 DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000 IDTR - 000000006AC3D018 0000000000000FFF, TR - 0000000000000000 FXSAVE_STATE - 000000005D29E5A0 !!!! Can't find image information. !!!!
(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)
Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).
In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.
I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.
(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)
PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.