2021-05-26
Working around an AMDGPU automatic fan control problem on my Radeon RX 550
Back at the end of March, I discovered that the fans on my Radeon RX 550 weren't running under Fedora 33's 5.11.7 kernel, and also that Linux's hwmon system was willing to claim that it was (at 2100 RPM). At the time I tracked this down to something to do with pulse width modulation (pwm) fan control, where the kernel was deciding that the card's PWM duty cycle should be '0', ie never on. Unfortunately, this problem has continued in every Fedora kernel since then, up to and including the current 5.12.6. Today, sparked by updating the bug I filed, I spent some time attempting to investigate this and found a workaround, although not a fix.
The kernel module responsible for handling my card's fans is the
amdgpu GPU
driver module. Its documentation has a section on thermal controls
and monitoring,
which covers the GPU fan interface exposed through sysfs. As is the
usual case, amdgpu set the GPU fan control to 'automatic fan speed
control', which is the '2' value in the pwm1_enable
sysfs file
for the card. In 5.11 and later, the kernel boots up with the PWM
duty cycle at 0%, which is to say a 0 value in pwm1
.
(All of this is found in /sys/class/drm/card?/device/hwmon/hwmon?/. For my machine, it's 'card0' and 'hwmon2'.)
While the GPU fan was under automatic control, writing any value to
hwmon2/pwm1
did nothing; it stayed stuck at 0 (and the fan stayed
inactive). The first workaround was to take manual control of the fan
by writing 1 to hwmon2/pwm1_enable
and then writing some value
to hwmon2/pwm1
, say the typical '81' of the 5.10.x kernels. This
caused the fan to start going properly at its usual 800 RPM. The
second workaround is that once the fan was started this way, I could
write a '2' to hwmon2/pwm1_enable
, theoretically switching back
to automatic fan control, and the amdgpu driver properly took over
again. In automatic mode with what I can only describe as a woken up
driver, the driver (or the BIOS) is actually being more aggressive
with the GPU fan than in 5.10; the fan's currently running at 1800
RPM instead of 800 RPM, and the GPU is 2 degrees C cooler than it was
before.
(Something is actively controlling the fan because the value of
pwm1
shifts around on a rapid basis, going from 94 to 124 and
back every few seconds.)
This still feels like a fragile workaround to me. The automatic fan control has already failed once (on boot), so I don't have complete confidence that it won't fail again at some point. One option is to switch to manual fan control through a daemon that monitors the GPU temperature following some set-points. The Arch wiki has a section on AMDGPU sysfs fan control with some options and I've also read Controlling the fan curve of an AMD GPU on Pop!_OS. I'm currently vaguely biased towards amdgpu-fan, but I haven't tried any of the options out.
(There's also the Arch wiki amdgpu page.)
PS: Setting the kernel driver's 'runpm' parameter to 0 didn't fix the problem. Once I read the full description of the module parameter in the amdgpu documentation, this wasn't all that surprising.