Wandering Thoughts archives

2024-08-27

Some reasons why we mostly collect IPMI sensor data locally

Most servers these days support IPMI and can report various sensor readings through it, which you often want to use. In general, you can collect IPMI sensor readings either on the host itself through the host OS or over the network using standard IPMI networking protocols (there are several generations of them). Locally, we have almost always collected this information locally (and then fed it into our Prometheus based monitoring system), for an assortment of reasons, some of them general and some of them specific to us.

When we collect IPMI sensor data locally, we export it through the the standard Prometheus host agent, which has a feature where you can give it text files of additional metrics (cf). Although there is a 'standard' third party network IPMI metrics exporter, we ended up rolling our own for various reasons (through a Prometheus exporter that can run scripts for us). So we could collect IPMI sensor data either way, but we almost entirely collect the data locally.

(These days it is a standard part of our general Ubuntu customizations to set up sensor data collection from the IPMI if the machine has one.)

The generic reasons for not collecting IPMI sensor data over the network is that your server BMCs might not be on the network at all (perhaps they don't have a dedicated BMC network interface), or you've sensibly put them on a secured network and your monitoring system doesn't have access to it. We have two additional reasons for preferring local IPMI sensor data collection.

First, even when our servers have dedicated management network ports, we don't always bother to wire them up; it's often just extra work for relatively little return (and it exposes the BMC to the network, which is not always a good thing). Second, when we collect IPMI sensor data through the host, we automatically start and stop collecting sensor data for the host when we start or stop monitoring the host in general (and we know for sure that the IPMI sensor data really matches that host). We almost never care about IPMI data when either the host isn't otherwise being monitored or the host is off.

Our system for collecting IPMI sensor data over the network actually dates from when this wasn't true, because we once had some (donated) blade servers that periodically mysteriously locked up under some conditions that seemed related to load (so much so that we built a system to automatically power cycle them via IPMI when they got hung). One of the things we were very interested in was if these blade servers were hitting temperature or fan limits when they hung. Since the machines had hung we couldn't collect IPMI information through their host agent; getting it from the IPMI over the network was our only option.

(This history has created a peculiarity, which is that our script for collecting network IPMI sensor data used what was at the time the existing IPMI user that was already set up to remotely power cycle the C6220 blades. So now anything we want to remotely collect IPMI sensor data from has a weird 'reboot' user, which these days doesn't necessarily have enough IPMI privileges to actually reset the machine.)

PS: We currently haven't built a local IPMI sensor data collection system for our OpenBSD machines, although OpenBSD can certainly talk to a local IPMI, so we collect data from a few of those machines over the network.

sysadmin/IPMISensorDataLocalVsRemote written at 22:40:24; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.