2021-09-28
Avoiding flawed server ACPI information for power monitoring
Today I noticed that one of our servers was regularly logging a burst of kernel messages about problems with ACPI data. These messages looked like:
ACPI Error: No handler for Region [POWR] (000000000c9d7b92) [IPMI] (20190816/evregion-129) ACPI Error: Region IPMI (ID=7) has no handler (20190816/exfldio-261) No Local Variables are initialized for Method [_PMM] No Arguments are initialized for method [_PMM] ACPI Error: Aborting method \_SB.PMI0._PMM due to previous error (AE_NOT_EXIST) (20190816/psparse-529) ACPI Error: AE_NOT_EXIST, Evaluating _PMM (20190816/power_meter-325)
This surprised me, because in this day and age I would expect servers
like this (a current model from a name brand vendor) to not have
ACPI problems, especially with Linux. But here we are. This particular
set of ACPI error reports is happening because the Prometheus
host agent was trying
to read power usage information from /sys/class/hwmon
that was
theoretically available through ACPI.
In modern kernels, the acpi_power_meter
kernel module is what
extracts this information from the depths of ACPI (or tries to);
it is, to quote it, "a hwmon driver for ACPI 4.0 power meters". As
with all information from ACPI stuff, the driver does this by asking
the kernel's general ACPI subsystem to perform ACPI magic, and it's
this that is failing because Linux feels the BIOS's ACPI data has
problems.
Unfortunately there's no good way to fix bad ACPI data like this;
all we can do is stop looking at it. In this case, the best way to
do that is to unload the acpi_power_meter
module and blacklist
it so that it won't be reloaded on reboot. One set of directions
for this is in this Prometheus host agent issue.
(Since the module seems to not be able to do anything due to the bad ACPI information, I don't feel too bad about blocking it entirely.)
As a side note, this is another case of a set of kernel error messages
that should be rate-limited but aren't. The BIOS's ACPI data is rather
unlikely to change while the kernel is running, so this error is
essentially permanent until reboot. Reporting it every time something
peers at the /sys
hwmon files is not particularly useful and is
a great way to have your kernel messages spammed, driving out more
important things.