Sometimes it actually is a Linux kernel bug

July 19, 2018

For my sins, I use a number of third party 'out of kernel' kernel modules in my Fedora kernels, especially on my office workstation. I don't use a binary GPU driver, but there's the latest git tip of ZFS on Linux, VMWare's kernel modules, and an out of tree it87 module in order to support my motherboard's sensors (for as long as that keeps working). Usually this works fine. Usually. For the past few days, my office machine has been panicing during our nightly backups when Amanda runs a big tar over some of my filesystems.

There's a long standing saying in programming that 'it's never a compiler bug'. I have a similar rule of thumb about kernel panics; given that I use a number of third party modules, especially the VMWare modules, any kernel panics I run into are caused by them.

(I mean, apart from the system lockups, which are AMD's fault, and the amdgpu problems, which was a graphics driver issue. It's a rule of thumb, and it's mostly true about core kernel code. Linux kernel driver code is a little bit more likely to have bugs.)

So I assumed that the cause of my sudden panics was probably ZFS on Linux (an assumption helped along when I accidentally ran my machine without the VMWare modules and it still paniced). After some diagnostic work, I reduced things down to a belief of 'ZoL and the latest Fedora kernels don't like each other', went to report an issue, and found ZoL issue #7723 and thus Fedora #1598462. Which led to my tweet:

Today I learned that Fedora 27 and 28 kernels after 4.17.3 are known to panic under high IO load. Better late than never, but I could have used that knowledge before upgrading the office machine to 4.17.5.

So, yeah. To my surprise, this actually is a (general) Linux kernel bug, not any of the third party modules it happens. This feels like the equivalent of finding a genuine compiler bug.

(I can be pretty sure that I'm hitting the same bug, because I have a basic netconsole (also) setup, and my panic messages match the bug report's. They also run through ZFS functions, which didn't help my initial suspicions.)

PS: What made this more peculiar is that I've been running the Fedora 27 4.17.5 kernel at home without problems. But then, I don't have good home backups and I don't think I've done anything else recently to stress the home machine's IO. I should revert back to 4.17.3 anyway.

PPS: 4.17.7 kernels are in Fedora Bodhi but apparently not yet in the updates-testing DNF repo for Fedora 27. It looks like the most convenient way to get things from Bodhi is with the bodhi client program. I'm using it like so:

bodhi updates query --packages kernel --releases f27 --status pending
bodhi updates download --builds kernel-4.17.7-100.fc27

(You could leave out the '--status pending', but if you do you get a big list of past updates that aren't very interesting. If I'm fetching something from Bodhi it's because I can't get it anywhere else, so it's probably an update that's so new that it's not even in the updates-testing repo.)

Written on 19 July 2018.
« Some things on Illumos NFS export permissions
Linux's NFS exports permissions model compared to Illumos's »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Jul 19 00:26:59 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.