An infrequent odd kernel panic on our Ubuntu 18.04 fileservers

May 15, 2019

I have in the past talked about our shiny new Prometheus based metrics system and some interesting things we've seen due to its metrics, especially its per-host system metrics (collected through node_exporter, its host agent). What I haven't mentioned is that we're not running the host agent on one important group of our machines, namely our new Linux fileservers. This isn't because we don't care about metrics from those machines. It's because when we do run the host agent, we get very infrequent but repeating kernel panics, or I should say what seems to be a single panic.

The panic we see is this:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
IP: __atime_needs_update+0x5/0x190
CPU: 7 PID: 10553 Comm: node_exporter Tainted: P  O  4.15.0-30-generic #32-Ubuntu
RIP: 0010:__atime_needs_update+0x5/0x190
Call Trace:
 ? link_path_walk+0x3e4/0x5a0
 ? path_init+0x177/0x2f0
[... sometimes bogus frames here ...]
 ? __check_object_size+0xaf/0x1b0
 ? do_sys_open+0x1bb/0x2c0
 ? _cond_resched+0x19/0x40

The start and the end of this call trace is consistent between panics; the middle sometimes has various peculiar and apparently bogus frames.

This panic occurs only on our ZFS fileservers, which are a small minority of the servers where we have the host agent running, and only generally after server-months of operation (including intensive pounding of the host agent on test servers). The three obvious things that are different about our ZFS fileservers are that they are our only machines with this particular set of SuperMicro hardware, they are the only machines with ZFS, and they are our only 18.04 NFS servers. However, this panic has happened on a test server with no ZFS pools and no NFS exports.

If I believe the consistent portions of the call trace, this panic happens while following a symlink during an openat() system call. I have strace'd the Prometheus host agent and there turn out to not be very many such things it opens; my notes say /proc/mounts, /proc/net, some things under /sys/class/hwmon, and some things under /sys/devices/system/cpu/cpu*/cpufreq. Of these, the /proc entries are looked at on all machines and seem unlikely suspects, while the hwmon stuff is definitely suspect. In fact we have another machine where trying to look at those entries produces constant kernel reports about ACPI problems:

ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20170831/exfield-427)
ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20170831/psparse-550)
ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20170831/power_meter-338)

(ACPI is an area I suspect because it's part of the BIOS and so varies from system to system.)

However, it doesn't seem to be the hwmon stuff alone. You can tell the Prometheus host agent to not try to look at it (with a command line argument), and while running the host agent in this mode, we have had a crash on one of our test fileservers. Based on /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver being 'acpi-cpufreq', I suspect that ACPI is involved in this area as well on these machines.

(There is some documentation on kernel CPU frequency stuff in user-guide.txt, in the cpu-freq kernel documentation directory.)

Even if motherboard-specific ACPI stuff is what triggers this panic, the panic itself is worryingly mysterious. The actual panic is clearly a dereference of a NULL pointer, as __atime_needs_update attempts to refer to a struct field (cf). Based on this happening very early on in the function and the code involved, this is probably the path argument, since this is used almost immediately. However, I can't entirely follow how we get there, especially with a NULL path. Some of the context of the call is relatively clear; the call path probably runs from part of link_path_walk through an inlined call to get_link to a mysteriously not listed touch_atime and then to __atime_needs_update.

(I would be more confident of this if I knew how to use gdb to disassemble bits of the Ubuntu kernel to verify and map back the reported raw byte positions in these functions.)

I admit that this is the kind of situation that makes me yearn for crash dumps. Having a kernel crash dump to poke around in might well give us a better understanding of what's going on, possibly including a better call trace. Unfortunately even if it's theoretically possible to get kernel crash dumps out of Linux with the right setup, it's not standard for installers to actually set that up or offer it as a choice so as a practical matter it's mostly not there.

PS: We haven't tried upgrading the kernel version we're using on the fileservers because stable fileservers are more important to us than host metrics, and we know they're stable on this specific kernel because that's what we did extensive testing on. We might consider upgrading if we could find a specific bug fix for this, but so far I haven't spotted any smoking guns.

Written on 15 May 2019.
« Fixing Alpine to work over NFS on Ubuntu 18.04 (and probably other modern Linuxes)
Go has no type for types in the language »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 15 02:00:57 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.