2018-07-19
Sometimes it actually is a Linux kernel bug
For my sins, I use a number of third party 'out of kernel' kernel
modules in my Fedora kernels, especially on my office workstation. I don't use a binary GPU driver, but there's the
latest git tip of ZFS on Linux, VMWare's kernel modules, and an out of tree it87
module in order to support my
motherboard's sensors (for as long as that keeps working).
Usually this works fine. Usually. For the past few days, my office
machine has been panicing during our nightly backups when Amanda
runs a big tar
over some of my filesystems.
There's a long standing saying in programming that 'it's never a compiler bug'. I have a similar rule of thumb about kernel panics; given that I use a number of third party modules, especially the VMWare modules, any kernel panics I run into are caused by them.
(I mean, apart from the system lockups, which are AMD's fault, and the amdgpu problems, which was a graphics driver issue. It's a rule of thumb, and it's mostly true about core kernel code. Linux kernel driver code is a little bit more likely to have bugs.)
So I assumed that the cause of my sudden panics was probably ZFS on Linux (an assumption helped along when I accidentally ran my machine without the VMWare modules and it still paniced). After some diagnostic work, I reduced things down to a belief of 'ZoL and the latest Fedora kernels don't like each other', went to report an issue, and found ZoL issue #7723 and thus Fedora #1598462. Which led to my tweet:
Today I learned that Fedora 27 and 28 kernels after 4.17.3 are known to panic under high IO load. Better late than never, but I could have used that knowledge before upgrading the office machine to 4.17.5.
So, yeah. To my surprise, this actually is a (general) Linux kernel bug, not any of the third party modules it happens. This feels like the equivalent of finding a genuine compiler bug.
(I can be pretty sure that I'm hitting the same bug, because I have a basic netconsole (also) setup, and my panic messages match the bug report's. They also run through ZFS functions, which didn't help my initial suspicions.)
PS: What made this more peculiar is that I've been running the Fedora 27 4.17.5 kernel at home without problems. But then, I don't have good home backups and I don't think I've done anything else recently to stress the home machine's IO. I should revert back to 4.17.3 anyway.
PPS: 4.17.7 kernels are in Fedora Bodhi but apparently not yet in the
updates-testing DNF repo for Fedora 27. It looks like the most
convenient way to get things from Bodhi is with the bodhi
client
program. I'm using it like so:
bodhi updates query --packages kernel --releases f27 --status pending bodhi updates download --builds kernel-4.17.7-100.fc27
(You could leave out the '--status pending
', but if you do you
get a big list of past updates that aren't very interesting. If I'm
fetching something from Bodhi it's because I can't get it anywhere
else, so it's probably an update that's so new that it's not even
in the updates-testing repo.)