An infrequent odd kernel panic on our Ubuntu 18.04 fileservers
I have in the past talked about our shiny new Prometheus based metrics system and some interesting
things we've seen due to
its metrics, especially its per-host system metrics (collected
its host agent). What I haven't mentioned is that we're not running
the host agent on one important group of our machines, namely our
new Linux fileservers. This isn't because
we don't care about metrics from those machines. It's because when
we do run the host agent, we get very infrequent but repeating
kernel panics, or I should say what seems to be a single panic.
The panic we see is this:
BUG: unable to handle kernel NULL pointer dereference at 000000000000000c IP: __atime_needs_update+0x5/0x190 [...] CPU: 7 PID: 10553 Comm: node_exporter Tainted: P O 4.15.0-30-generic #32-Ubuntu RIP: 0010:__atime_needs_update+0x5/0x190 [...] Call Trace: ? link_path_walk+0x3e4/0x5a0 ? path_init+0x177/0x2f0 path_openat+0xe4/0x1770 [... sometimes bogus frames here ...] do_filp_open+0x9b/0x110 ? __check_object_size+0xaf/0x1b0 do_sys_open+0x1bb/0x2c0 ? do_sys_open+0x1bb/0x2c0 ? _cond_resched+0x19/0x40 SyS_openat+0x14/0x20 do_syscall_64+0x73/0x130 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
The start and the end of this call trace is consistent between panics; the middle sometimes has various peculiar and apparently bogus frames.
This panic occurs only on our ZFS fileservers, which are a small minority of the servers where we have the host agent running, and only generally after server-months of operation (including intensive pounding of the host agent on test servers). The three obvious things that are different about our ZFS fileservers are that they are our only machines with this particular set of SuperMicro hardware, they are the only machines with ZFS, and they are our only 18.04 NFS servers. However, this panic has happened on a test server with no ZFS pools and no NFS exports.
If I believe the consistent portions of the call trace, this panic
happens while following a symlink during an
openat() system call.
strace'd the Prometheus host agent and there turn out to
not be very many such things it opens; my notes say
/proc/net, some things under
/sys/class/hwmon, and some things
/sys/devices/system/cpu/cpu*/cpufreq. Of these, the
entries are looked at on all machines and seem unlikely suspects,
hwmon stuff is definitely suspect. In fact we have
another machine where trying to look at those entries produces
constant kernel reports about ACPI problems:
ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170831/exfield-427) ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170831/psparse-550) ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170831/power_meter-338)
(ACPI is an area I suspect because it's part of the BIOS and so varies from system to system.)
However, it doesn't seem to be the
hwmon stuff alone. You can
tell the Prometheus host agent to not try to look at it (with a
command line argument), and while running the host agent in this
mode, we have had a crash on one of our test fileservers. Based
'acpi-cpufreq', I suspect that ACPI is involved in this area as
well on these machines.
(There is some documentation on kernel CPU frequency stuff in user-guide.txt, in the cpu-freq kernel documentation directory.)
Even if motherboard-specific ACPI stuff is what triggers this panic,
the panic itself is worryingly mysterious. The actual panic is
clearly a dereference of a NULL pointer, as
attempts to refer to a struct field (cf).
Based on this happening very early on in the function and the
this is probably the
path argument, since this is used almost
immediately. However, I can't entirely follow how we get there,
especially with a NULL
path. Some of the context of the call is
relatively clear; the call path probably runs from part of
through an inlined call to
to a mysteriously not listed
and then to
(I would be more confident of this if I knew how to use
disassemble bits of the Ubuntu kernel to verify and map back the
reported raw byte positions in these functions.)
I admit that this is the kind of situation that makes me yearn for crash dumps. Having a kernel crash dump to poke around in might well give us a better understanding of what's going on, possibly including a better call trace. Unfortunately even if it's theoretically possible to get kernel crash dumps out of Linux with the right setup, it's not standard for installers to actually set that up or offer it as a choice so as a practical matter it's mostly not there.
PS: We haven't tried upgrading the kernel version we're using on the fileservers because stable fileservers are more important to us than host metrics, and we know they're stable on this specific kernel because that's what we did extensive testing on. We might consider upgrading if we could find a specific bug fix for this, but so far I haven't spotted any smoking guns.