Uncertainties and issues in using IPMI temperature data

August 12, 2024

In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).

The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.

(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)

The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.

(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)

We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.

(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)


Comments on this page:

By Walex at 2024-08-13 05:12:27:

"temperature differences of multiple degrees C between machines higher and lower in racks"

In a not-so-funny case I am aware of a quite large cluster was densely packed into racks, with 41U of worker nodes and 1U at the top of "top-of-rack" (a rather dumb idea that so many people blindly adopt) switch. When the top 10-15 worker nodes were switched on the rising heat was such that the top-of-rack switch stopped working, making the whole rack inaccessible. So they had to keep around 1/3 of all worker nodes permanently switched off.

"a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)"

You are lucky, a machine room I have "inherited" has rather uneven airflow so I see difference of 12-14C. That is a bit extreme.

In general given how badly so many machine rooms are badly designed that I like in-rack cooling (which unfortunately makes cabling harder, especially with "top-of-rack" switches).

But that is a pointless preference because on-premises machine rooms, for universities or small to medium businesses, are simply disappearing, replaced by "cloud" in the worst case and colocation in the best case.

Written on 12 August 2024.
« ZFS properties sometimes change their default values over time
Some thoughts on OpenSSH 9.8's PerSourcePenalties feature »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Aug 12 23:24:40 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.