Wandering Thoughts archives

2024-12-21

When power cycling your (x86) server isn't enough to recover it

We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.

Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:

!!!! X64 Exception Type - 12(#MC - Machine-Check)  CPU Apic ID - 00000000 !!!!
RIP  - 000000006DABA5A5, CS  - 0000000000000038, RFLAGS - 0000000000010087
RAX  - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001
RBX  - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520
RSI  - 0000000000000008, RDI - 000000006AB1B1B0
R8   - 000000005DCCF524, R9  - 000000005D29E850, R10 - 000000005D29E8E4
R11  - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001
R14  - 0000000000000028, R15 - 0000000000000000
DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
GS   - 0000000000000030, SS  - 0000000000000030
CR0  - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000
CR4  - 0000000000000668, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000006AC3D018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - 000000005D29E5A0
!!!! Can't find image information. !!!!

(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)

Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).

In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.

I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.

(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)

PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.

tech/ServerWhenPowerCycleNotEnough written at 22:43:17;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.