Wandering Thoughts archives

2021-03-26

Linux's hardware monitoring can lie to you

Let's start with my tweets:

Fedora 33's 5.11.7 kernel seems to do a real number on power/fan/temperature levels of my Radeon RX 550 sitting idle in framebuffer text console. Fan RPMs went from 780 or so to 2100 and temperature jumped from 29C to 31C and climbing.

[...]

Ah. The reason my GPU's temperature is steadily climbing despite the fans running at 2100 RPM or so, as reported by Linux hardware monitoring, is because the fans are in fact not running at all.

The Linux kernel exposes hardware monitoring information in /sys, as covered in some kernel documentation, although you need the relevant drivers to support this. My office machine has an AMD Radeon RX 550, and the kernel amdgpu driver module for it exposes various sensor information through this general hardware monitoring interface. Lm_sensors reports the driver's reported sensors as 'vddgfx' (in volts), 'fan1', 'edge' (temperature), and 'power1' (in watts).

(The exact /sys path for a GPU is somewhat arcane, but you can usually get to it with /sys/class/drm/card0/device/hwmon/ and then some numbered 'hwmonN' subdirectory. There's also /sys/kernel/debug/dri/0 with various things, including an amdgpu_pm_info file that reports things in text.)

My GPU's fan (really fans) seem to use pulse width modulation (pwm), based on PWM-related sensor information showing up in amdgpu's hwmon directory. Under 5.11.7 (and 5.11.8), the PWM value appears to be 0 (instead of its usual '81'). I suspect that this means that regardless of the reported RPMs, the PWM duty cycle was 0% and so the fan wasn't turning. Why the GPU and the amdgpu driver together reported 2100 RPM instead of some other value, I have no idea (and it wasn't a constant 2100 RPM, it fluctuated around a bit).

At a minimum, this tells me that straightforward interpretations of hwmon values may be misleading because you need to look at other hwmon values for context. More generally, hwmon values are only as trustworthy as the combination of the hardware and the driver reporting them and clearly some combinations don't report useful values. Common tools, like lm_sensors, may not cover corner cases (such as the PWM duty cycle being 0), so looking at their output may mislead you about the state of your hardware. In the end, nothing beats actually looking at things in person, which is a little bit alarming in these work from home times when that's a bit difficult.

(The good news is that the Prometheus host agent does capture the hwmon pwm, so you can go back and look for anomalies.)

linux/HwmonCanLie written at 00:15:17; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.