Wandering Thoughts archives

2021-09-28

Avoiding flawed server ACPI information for power monitoring

Today I noticed that one of our servers was regularly logging a burst of kernel messages about problems with ACPI data. These messages looked like:

ACPI Error: No handler for Region [POWR] (000000000c9d7b92) [IPMI] (20190816/evregion-129)
ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-261)
No Local Variables are initialized for Method [_PMM]
No Arguments are initialized for method [_PMM]
ACPI Error: Aborting method \_SB.PMI0._PMM due to previous error (AE_NOT_EXIST) (20190816/psparse-529)
ACPI Error: AE_NOT_EXIST, Evaluating _PMM (20190816/power_meter-325)

This surprised me, because in this day and age I would expect servers like this (a current model from a name brand vendor) to not have ACPI problems, especially with Linux. But here we are. This particular set of ACPI error reports is happening because the Prometheus host agent was trying to read power usage information from /sys/class/hwmon that was theoretically available through ACPI.

In modern kernels, the acpi_power_meter kernel module is what extracts this information from the depths of ACPI (or tries to); it is, to quote it, "a hwmon driver for ACPI 4.0 power meters". As with all information from ACPI stuff, the driver does this by asking the kernel's general ACPI subsystem to perform ACPI magic, and it's this that is failing because Linux feels the BIOS's ACPI data has problems. Unfortunately there's no good way to fix bad ACPI data like this; all we can do is stop looking at it. In this case, the best way to do that is to unload the acpi_power_meter module and blacklist it so that it won't be reloaded on reboot. One set of directions for this is in this Prometheus host agent issue.

(Since the module seems to not be able to do anything due to the bad ACPI information, I don't feel too bad about blocking it entirely.)

As a side note, this is another case of a set of kernel error messages that should be rate-limited but aren't. The BIOS's ACPI data is rather unlikely to change while the kernel is running, so this error is essentially permanent until reboot. Reporting it every time something peers at the /sys hwmon files is not particularly useful and is a great way to have your kernel messages spammed, driving out more important things.

linux/ACPIFlawedPowerMonitoring written at 23:55:05;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.