2021-10-08
We've migrated from Yubikey 2FA to the university's MFA
We have a sensitive host that absolutely has to be protected with multi-factor authentication. When we first set it up in late 2016, the second factor we chose was touch-required SSH keys held on Yubikeys. Recently, we have been switching this host over to the university's institutional multi-factor authentication. The university's MFA uses Duo, so our sensitive host is set up to use Duo's PAM module.
(Integrating Duo with OpenSSH led me to explode what combinations of authentication methods we could and couldn't support with PAM-based MFA.)
Relying on the institutional MFA has some failure modes that using Yubikeys doesn't, since verifying Yubikey SSH key authentication is entirely contained on the host. However, we decided that we could live with these additional failure modes because using the institutional MFA had some significant advantages for us. I can boil these down to three general areas: availability, usability, and manageability for the people involved.
In terms of availability, institutional MFA has the advantage that people already have to routinely use it for all sorts of things, so they had it set up, working, and ready to hand. Our Yubikeys were only ever used for this host, so if you didn't log in to the host for a while, they could wind up not so available and ready. And in general our Yubikeys were yet another thing for people to manage and keep track of, like an extra and rarely used key.
In terms of usability, the institutional MFA is a lot easier to get going and work with for SSH logins, because all it demands from your SSH login session is that you type extra text. Yubikeys required a USB connection and appropriate software to connect to the Yubikey in either your SSH client or your SSH agent. Not all things that can run a SSH client even have USB, and of the computers that do, often the software was an issue.
(In the future, OpenSSH's new(ish) support for FIDO/U2F may help some of this, but only for things that wind up running OpenSSH 8.2 or better. In practice this means it will be years before Windows, macOS, iOS, and Android SSH clients can all reliably take advantage of it.)
In terms of manageability, the institutional MFA has other people who handle all aspects of enrolling people, managing their devices, recording the necessary data for server side authentication, and so on. With Yubikeys, all of that was on us and it wasn't necessarily a smooth and easy process. In fact it was so friction prone that we would never have scaled it up beyond the small group of people who needed to have access to this sensitive server.
The Yubikey solution was simpler, theoretically more reliable, and potentially more secure (and certainly more under our control) than the institutional MFA system is. But in practice both the ease of using and the ease of managing whatever we used for MFA turned out to matter quite a bit, and Yubikeys weren't really good at either of these. Institutional MFA is good enough, it's officially blessed by the university, and it's much easier for everyone to deal with, so it wins in practice.
(I admit that the Yubikey SSH key generation security issue soured me on really trusting some parts of the theoretical Yubikey advantages and shifted my views on where I should generate keys, as well as making me kind of unhappy with Yubikeys in general.)
What Linux kernel "unknown reason" NMI messages mean
Today, my office workstation logged a kernel message (well, a set of them) that I've seen versions of before, and perhaps you have too:
Uhhuh. NMI received for unknown reason 31 on CPU 13. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue
While I (still) don't know what caused this and what to do about it (other than reboot the machine in the hopes that it stops happening), this time I looked into the kernel source to at least figure out what the 'reason 31' means and what is generally going on here. I will put the summary up front: the specific reason number is probably meaningless and at least somewhat random. I don't think it tells you anything about the potential causes.
The 'NMI' here is short for Non-maskable interrupt; the OSDev wiki has an x86-focused page on them. In the Linux kernel, NMIs can be generated for various reasons, some of which are specific for a single CPU and some of which are general and may be handled by any CPU. When a kernel driver enables something that may generate NMIs (of either type), it registers a NMI handler for it. Typical source of and handlers for non CPU specific NMIs include watchdog timers and the kernel debugger. NMI handlers are called on every NMI and each is expected to check its NMI source and tell the kernel if the NMI came from it (well, more or less). If no handler speaks up to say it handled the NMI and certain other conditions are true, the kernel will generate this particular 'unknown reason' message.
(Actually, the 'local' NMI handlers are called first. If any of them say they handled an NMI, the kernel assumes the entire NMI was for a per-CPU reason and stops there.)
On normal x86 hardware, the reason number in the message comes from reading a specific x86 I/O port, what the OSDev wiki calls 'System Control Port B (0x61)'. This port is actually 8 separate status bits together, and the Linux kernel's reason is reported in hex, not decimal, so the reason here should be decoded from hex to binary, where we will find out that it's 0b110001, with bits 6, 5, and 1 set.
When the Linux kernel handles a non CPU specific NMI in
default_do_nmi()
,
it starts out by seeing if either or both of bit 8, NMI_REASON_SERR
,
or bit 7, NMI_REASON_IOCHK
, are set. If bit 8 is set and no
SERR handler take the NMI, the kernel will report:
NMI: PCI system error (SERR) for reason ... on CPU ...
If bit 8 is not set and bit 7 is set (and no IOCHK handler takes the NMI), the kernel will report:
NMI: IOCK error (debug interrupt?) for reason ... on CPU ...
(The bit is called IOCHK
but the message really does say 'IOCK'
instead.)
If either bit is set, the "unknown reason" kernel message is skipped for this NMI; it's considered handled by the PCI or IOCK handler. So as far as I can tell, the largest "unknown reason" number you'll ever see is 3f (remember, this is hex), because anything larger than that sets at least one of the high two bits and will take the SERR or IOCK path.
(All of this is in nmi.c.)
In theory the OSDev wiki page has a nice table of what the low five bits in System Control Port B tell you about your uknown NMI. In practice the information seems relatively inscrutable and meaningless. For instance, in the original IBM PC designs, bit 5 toggled back and forth on every DRAM refresh, bit 6 was system timer 2's output pin state, and bits 3 and 4 seemed to reflect whether or not you had enabled parity checks (bit 8) and channel checks (bit 7). What these mean on modern x86 hardware is anyone's guess; they may mean very little. Linux only cares about bits 8 and 7.
Based on all of this, I think that the 'unknown reason' likely says nothing about what caused the NMI to be generated or about what the (interesting) state of the hardware is. An 'unknown reason' NMI came from some source that was not recognized by any handler, which means that either there is no handler registered for its source (for example hardware is generating unexpected NMIs) or the handler didn't recognize that its hardware caused the NMI. Based on the kernel message about power savings mode, these seem to have at one point been a fruitful source of surprise NMIs.
(That kernel message seems to go back quite a way, although it's hard to trace it because code has moved around a lot between files. I think there's a way to do this in git, but I lack the energy to work it out right now.)