Working around an AMDGPU automatic fan control problem on my Radeon RX 550

May 26, 2021

Back at the end of March, I discovered that the fans on my Radeon RX 550 weren't running under Fedora 33's 5.11.7 kernel, and also that Linux's hwmon system was willing to claim that it was (at 2100 RPM). At the time I tracked this down to something to do with pulse width modulation (pwm) fan control, where the kernel was deciding that the card's PWM duty cycle should be '0', ie never on. Unfortunately, this problem has continued in every Fedora kernel since then, up to and including the current 5.12.6. Today, sparked by updating the bug I filed, I spent some time attempting to investigate this and found a workaround, although not a fix.

The kernel module responsible for handling my card's fans is the amdgpu GPU driver module. Its documentation has a section on thermal controls and monitoring, which covers the GPU fan interface exposed through sysfs. As is the usual case, amdgpu set the GPU fan control to 'automatic fan speed control', which is the '2' value in the pwm1_enable sysfs file for the card. In 5.11 and later, the kernel boots up with the PWM duty cycle at 0%, which is to say a 0 value in pwm1.

(All of this is found in /sys/class/drm/card?/device/hwmon/hwmon?/. For my machine, it's 'card0' and 'hwmon2'.)

While the GPU fan was under automatic control, writing any value to hwmon2/pwm1 did nothing; it stayed stuck at 0 (and the fan stayed inactive). The first workaround was to take manual control of the fan by writing 1 to hwmon2/pwm1_enable and then writing some value to hwmon2/pwm1, say the typical '81' of the 5.10.x kernels. This caused the fan to start going properly at its usual 800 RPM. The second workaround is that once the fan was started this way, I could write a '2' to hwmon2/pwm1_enable, theoretically switching back to automatic fan control, and the amdgpu driver properly took over again. In automatic mode with what I can only describe as a woken up driver, the driver (or the BIOS) is actually being more aggressive with the GPU fan than in 5.10; the fan's currently running at 1800 RPM instead of 800 RPM, and the GPU is 2 degrees C cooler than it was before.

(Something is actively controlling the fan because the value of pwm1 shifts around on a rapid basis, going from 94 to 124 and back every few seconds.)

This still feels like a fragile workaround to me. The automatic fan control has already failed once (on boot), so I don't have complete confidence that it won't fail again at some point. One option is to switch to manual fan control through a daemon that monitors the GPU temperature following some set-points. The Arch wiki has a section on AMDGPU sysfs fan control with some options and I've also read Controlling the fan curve of an AMD GPU on Pop!_OS. I'm currently vaguely biased towards amdgpu-fan, but I haven't tried any of the options out.

(There's also the Arch wiki amdgpu page.)

PS: Setting the kernel driver's 'runpm' parameter to 0 didn't fix the problem. Once I read the full description of the module parameter in the amdgpu documentation, this wasn't all that surprising.

Written on 26 May 2021.
« Rust is a wave of the future
Being able to see links I've visited in Firefox is startlingly better »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 26 00:03:17 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.