Uncertainties and issues in using IPMI temperature data
In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).
The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.
(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)
The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.
(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)
We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.
(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)
|
|