My Ryzen-based Linux office machine appears to finally be stable
Back in January, I switched over to my Ryzen-based office workstation and unfortunately more or less immediately hit problems, the most pernicious of which was an ongoing hang under some circumstances when the machine became idle, which turned out to be a known issue that a fair number of people were running into (Fedora, kernel.org, Ubuntu). From the bug reports about the issue, I was able to research some kernel parameters that stabilized my system, but I didn't consider this really satisfactory for various reasons.
For a long time these magic kernel command line parameters and similar tricks were the only workarounds available, at least to me. However there had long been rumors of a magic AMD provided magic firmware option that could work around the problem, generally exposed to you and me in a BIOS setting called 'Power Supply Idle Control', which you allegedly wanted to set to 'Typical current idle'. This apparently became available starting with AGESA 184.108.40.206a, which various motherboard vendors rolled into their overall BIOS at very different times. For bonus fun, apparently not all BIOS vendors even expose these AMD firmware settings, although enthusiast motherboards usually do.
(AMD may have released this as far back as last December, but on the ASUS Prime X370-PRO it appeared no earlier than BIOS 4008, from mid-April, and perhaps required the June 2nd BIOS 4011.)
I've been running my Fedora 27 Ryzen workstation with only this BIOS setting (ie, with no more special kernel command line parameters) since June 11th, using Asus's Prime X370-PRO BIOS version 4011. Although Fedora keeps coming out with kernel updates that get me to reboot the machine, it has been stable overnight and over weekends, which is something that it couldn't manage before on the rare occasions when I took out my kernel parameters workaround as an experiment. Over this time I've used both the Fedora 27 4.16.x kernel and just recently the 4.17.x kernel from the updates-testing repo; both have been stable and free of hangs (so far).
Given my experience so far and that most people who've tried this BIOS option have also reported good results with it, I'm cautiously optimistic that my machine is now stable without needing kernel behavior changes. I haven't re-done my power measurements so I have no idea if the machine uses somewhat more power when deeply idle, and honestly I don't care.
This has been a long time coming, but at least it seems to finally be here.
(It's possible that I could have done this back in mid-April, but the timing was bad to try it out at the time for reasons beyond the scope of this entry. In general I haven't been feeling very enthusiastic about taking stability risks with this machine; once the kernel parameters seemed to work I was willing to let things sit for a while instead of rushing into more experimentation and possible failures and frustrations.)
As a side note, finding the option in your BIOS is generally a bit tricky because it's usually hiding inside an AMD-provided blob of settings. On the Prime X370-PRO (which I believe is typical), you have to go to the 'Advanced' menu of additional settings, then go down to the bottom to something called 'AMD CBS' or 'CBS', and expand it to actually see the setting. Unlike vendor-provided BIOS settings, there probably isn't any documentation.
(The stuff in the AMD CBS submenu is apparently something AMD supplies to vendors as basically a black box blob that they insert somewhere in their UEFI menus. What AMD includes in the settings varies from AGESA version to AGESA version and they're generally mostly undocumented.)
Sidebar: Why I switched from kernel parameters to the BIOS setting
The short version is that I considered the kernel parameters to be
fragile magic, specifically the
rcu_nocbs setting, since it
pretty much had to be staving off the hangs only through some
indirect and perhaps coincidental effect on the overall system's
behavior. The problem with indirect, undesigned, and coincidental
effects is that they can easily go away or change when people make
The AMD BIOS setting is its own sort of magic, but at least it's
direct magic and hopefully it's at less risk of being destabilized
by kernel or system changes.