2021-10-11
Unknown NMIs and counting hardware CPU events in eBPF programs
I mentioned in a recent entry that my office workstation had started producing alarming kernel messages about non-maskable interrupts (NMIs) happening for an unknown reason:
Uhhuh. NMI received for unknown reason 31 on CPU 10. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue
I've now been able to identify what triggers these NMI messages. On my office machine they can reliably be produced by running the Cloudflare eBPF Prometheus exporter with the ipcstat example exporter, which uses perf events to count CPU instructions and CPU cycles, processes them through an eBPF program, and lets you query the result as Prometheus metrics. They don't happen all of the time (only every so often) and they don't seem to be particularly correlated with anything (they don't happen every time I scrape metrics from the Cloudflare eBPF exporter, for example). They may require actually obtaining metrics from the Cloudflare exporter so that it gets them from the kernel eBPF program; I'm not sure yet.
(This isn't triggered just by the Cloudflare eBPF exporter in general, because I've been running it for a long time to get disk IO latency histograms. Taking the ipc eBPF program out of my eBPF exporter configuration stops the messages; running a separate eBPF exporter instance with just that program causes them to start again.)
My office machine is running Fedora 34 with Fedora's 64-bit
'5.14.9-200.fc34' kernel, on a machine with an AMD Ryzen 7 1800X.
My home machine is running the same Fedora
kernel and the same Cloudflare eBPF exporter (with the same eBPF
programs), but has an Intel i7-8700K CPU and doesn't get these
unknown reason NMIs. Nor have I been able to produce these NMIs so
far by running 'perf stat -a
' on my office machine. My leading
theory is that there's some combination of obtaining CPU performance
counters, in an eBPF program, and possibly pulling data from it on
a regular basis from user level that is triggering this on (some)
Ryzen CPUs.
(I've experimented with a bpftrace
command line that I think is
doing much the same as the eBPF exporter's program, but haven't
seen anything yet. The problem can go hours without triggering,
though.)
BPF programs apparently do run from NMIs for handling perf events such as counting CPU cycles (source), so this seems not completely implausible. I don't know if perf events normally trigger NMIs or if there's a different mechanism.
The large scale moral I take from this is that eBPF programs aren't necessarily as non-invasive as they're often presented as. In a perfect world this obviously wouldn't happen, but in this world we deal with the hardware and kernel bugs that we have, like it or not. I'll have to take care with any future eBPF usage and pay attention to potential correlations with, for example, new kernel messages.
(For my own future reference when doing Internet searches, most sources seem to just talk about 'BPF' instead of 'eBPF'.)
PS: I don't have test results for kernels before this one because I only recently started running this eBPF program on my office workstation. On my home desktop I've been running it for some time without problems in previous kernel versions.