What Linux kernel "unknown reason" NMI messages mean
Today, my office workstation logged a kernel message (well, a set of them) that I've seen versions of before, and perhaps you have too:
Uhhuh. NMI received for unknown reason 31 on CPU 13. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue
While I (still) don't know what caused this and what to do about it (other than reboot the machine in the hopes that it stops happening), this time I looked into the kernel source to at least figure out what the 'reason 31' means and what is generally going on here. I will put the summary up front: the specific reason number is probably meaningless and at least somewhat random. I don't think it tells you anything about the potential causes.
The 'NMI' here is short for Non-maskable interrupt; the OSDev wiki has an x86-focused page on them. In the Linux kernel, NMIs can be generated for various reasons, some of which are specific for a single CPU and some of which are general and may be handled by any CPU. When a kernel driver enables something that may generate NMIs (of either type), it registers a NMI handler for it. Typical source of and handlers for non CPU specific NMIs include watchdog timers and the kernel debugger. NMI handlers are called on every NMI and each is expected to check its NMI source and tell the kernel if the NMI came from it (well, more or less). If no handler speaks up to say it handled the NMI and certain other conditions are true, the kernel will generate this particular 'unknown reason' message.
(Actually, the 'local' NMI handlers are called first. If any of them say they handled an NMI, the kernel assumes the entire NMI was for a per-CPU reason and stops there.)
On normal x86 hardware, the reason number in the message comes from reading a specific x86 I/O port, what the OSDev wiki calls 'System Control Port B (0x61)'. This port is actually 8 separate status bits together, and the Linux kernel's reason is reported in hex, not decimal, so the reason here should be decoded from hex to binary, where we will find out that it's 0b110001, with bits 6, 5, and 1 set.
When the Linux kernel handles a non CPU specific NMI in
it starts out by seeing if either or both of bit 8,
or bit 7,
NMI_REASON_IOCHK, are set. If bit 8 is set and no
SERR handler take the NMI, the kernel will report:
NMI: PCI system error (SERR) for reason ... on CPU ...
If bit 8 is not set and bit 7 is set (and no IOCHK handler takes the NMI), the kernel will report:
NMI: IOCK error (debug interrupt?) for reason ... on CPU ...
(The bit is called
IOCHK but the message really does say 'IOCK'
If either bit is set, the "unknown reason" kernel message is skipped for this NMI; it's considered handled by the PCI or IOCK handler. So as far as I can tell, the largest "unknown reason" number you'll ever see is 3f (remember, this is hex), because anything larger than that sets at least one of the high two bits and will take the SERR or IOCK path.
(All of this is in nmi.c.)
In theory the OSDev wiki page has a nice table of what the low five bits in System Control Port B tell you about your uknown NMI. In practice the information seems relatively inscrutable and meaningless. For instance, in the original IBM PC designs, bit 5 toggled back and forth on every DRAM refresh, bit 6 was system timer 2's output pin state, and bits 3 and 4 seemed to reflect whether or not you had enabled parity checks (bit 8) and channel checks (bit 7). What these mean on modern x86 hardware is anyone's guess; they may mean very little. Linux only cares about bits 8 and 7.
Based on all of this, I think that the 'unknown reason' likely says nothing about what caused the NMI to be generated or about what the (interesting) state of the hardware is. An 'unknown reason' NMI came from some source that was not recognized by any handler, which means that either there is no handler registered for its source (for example hardware is generating unexpected NMIs) or the handler didn't recognize that its hardware caused the NMI. Based on the kernel message about power savings mode, these seem to have at one point been a fruitful source of surprise NMIs.
(That kernel message seems to go back quite a way, although it's hard to trace it because code has moved around a lot between files. I think there's a way to do this in git, but I lack the energy to work it out right now.)