My new Ryzen desktop is causing Linux to hang (and it's frustrating)
Normally I try to stick to a sunny tone here. Today is an unfortunate exception, since it's a problem that I'm very close to and that I have no solution for.
Last Friday, I assembled the hardware for my new office workstation, updated the BIOS, and over the weekend let it sit turned on with a scratch Fedora install and periodically doing some burnin tests, like mprime. Everything went fine. Normally I might have probably let the assembled machine sit for a while and do additional burnin tests before doing anything more, but over the weekend my current (old) workstation showed worrying signs of more flakiness, so on Monday I swapped my disks over to the new Ryzen-based hardware. Everything came up quite easily and it all looked good (and clearly faster), right up until the machine started locking up. At first I thought I had a culprit in the amdgpu kernel driver used by the new machine's Radeon RX 550 based graphics card, and I turned up a Fedora bug with a workaround. Unfortunately that doesn't appear to be a complete fix, because the machine has hung several times since then. For the latest hangs I've had netconsole enabled, and I've actually gotten output; unfortunately this has just made it more frustrating, because it is just a steady stream of 'watchdog: BUG: soft lockup - CPU#4 stuck for 23s!' reports.
(These reports are interesting in a way, because apparently the system is not so stuck that it cannot increment the timer. However, it is so stuck that it doesn't respond to the network or to the console, and it doesn't seem to notice a console Magic SysRq.)
In both sets of netconsole trances I've collected so far, I see
things running through cross-CPU communication and often TLB
stuff in general and specifically
Call Trace: native_flush_tlb_others+0xd4/0x130 flush_tlb_mm_range+0xae/0x120 tlb_flush_mmu_tlbonly+0x80/0xe0 arch_tlb_finish_mmu+0x3f/0x80 tlb_finish_mmu+0x23/0x30 unmap_region+0xf7/0x130 ? __vma_rb_erase+0x1f1/0x270 do_munmap+0x27c/0x460 vm_munmap+0x69/0xb0 SyS_munmap+0x22/0x30 entry_SYSCALL_64_fastpath+0x20/0x83
This is interesting because of an old reddit post that blames this on 'Core C6 State', and there's also this bug report. On the other hand, the machine sat idle all weekend and didn't hang; in fact, it would have been more idle on the weekend than it was when it hung recently. However I'm into grasping at straws here.
The sensible thing to do right now is probably to swap my disks back into my old hardware until I have more time to deal with the problem (I have to get things stabilized tomorrow). But it is tempting to grasp at a number of straws:
- swap the RX 550 card out for the very basic card in my old
machine. This should completely eliminate both
amdgpu and the new GPU hardware itself as a source of issues.
- switch back to a kernel before
CONFIG_RETPOLINE, because I use a number of out of tree modules and I've noticed their build process muttering about my
gccnot having the needed support for this. I'm using the latest Fedora released gcc and you'd hope that that would be good enough, but I have no idea what's going on.
- go through the BIOS to turn off 'Core C6 State' and any other fancy tuning options (and verify that it hasn't decided to silently turn on some theoretically mild and harmless automatic overclocking options). It's possible that the BIOS is deciding to do things that Linux objects to, although I don't know why it would have started to fail only once I swapped disks around. (The paranoid person wonders about UEFI versus MBR booting, but I'm not sure I'm that paranoid.)
(If I did all of this and the machine hung anyway, well, I'd be able to swap my disks back into my old desktop with no regrets.)
In the longer term, troubleshooting this and reporting any issues is probably going to be quite complicated. One of the problems is that I absolutely have to have one out of kernel module (ZFS on Linux) and I very much want another one (WireGuard). I suspect that the presence of these will cause any bug reports to be rejected more or less out of hand. In an ideal world this problem will reproduce itself on a scratch Fedora install with a stock kernel environment that's doing things like running graphics stress programs, but I'm not going to hold my breath on that. It seems quite possible that it will only happen if I'm actually using the machine, which has all sorts of problems.
(I have one complicated idea but it is complicated and rather annoying.)
The whole thing is frustrating and puzzling. We have a stable Ubuntu
machine with a Ryzen 1800X and the same motherboard (but a different
GPU), and this machine itself seemed fine right up until I swapped
in my existing disks. Even post-swap it was perfectly fine with a
mprime -t run over last night. But if I use it it hangs
sooner or later, and it now seems to be hanging even when I don't
(And it appears that this motherboard doesn't have a hardware watchdog timer that's currently supported by Linux. I tried enabling the software watchdog, but it didn't trigger for literally hours and then when it did, it apparently hasn't managed to actually reboot the system, which is perhaps not too surprising under the circumstances.)
PPS: This does put a rather large crimp in my Ryzen temptation, especially if this is something systemic and widespread.
Sidebar: It's possible that I've had multiple issues
I may have hit both an amdgpu issue with Radeon RX 550s, which I've
now mitigated, and some sort of issue with the BIOS putting a chunk
of the machine to sleep and then Linux not being able to wake it
up again. My initial hangs definitely happened while I was in front
of the machine actively using it, but I believe that the hangs since
amdgpu.dpm=0 this morning have been when I wasn't around
the machine and it was at least partially idle. These are the only
hangs that I have netconsole logs for, too, and they show that the
machine is partially alive instead of totally hung.