The attractions of reading sensor information from IPMIs
Most modern servers have some sort of onboard IPMI (sometimes called a service processor), and commonly the IPMI has access to information from various sensors, which it can provide to you on demand. Usually you can query the IPMI both locally and over the network (if you've connected the IPMI to the network at all). In addition, CPUs, motherboards, and various components such as NVMe drives and GPU cards can have their own sensors, which the server's operating system can expose to you. On Linux, this is done through the hwmon subsystem, using hwmon drivers for the various accessible sensor chips and sensors. Although generally it's a lot easier to use Linux's hwmon interface than querying your IPMI (and a lot more things will automatically look at it, such as host agents for metrics systems), there are still reasons to want to get sensor information from your server's IPMI.
The first reason is that you may not have a choice. On servers, some sensors may only be reported to the IPMI and not to the main server motherboard. I think this is especially common for things like power supply information and fan RPMs, where it may be significantly more complicated to provide readings to two places. If you don't go out and talk to the IPMI, all you may get is some basic temperature information and perhaps a few voltages. As far as I can tell, this is the case for many of our Dell servers.
A big reason to read sensor information from the IPMI even if you have a choice is that unlike the kernel, the IPMI is generally guaranteed to know what sensors it actually has, what they're all called (including things like which fan RPM sensor is for which fan), and how to get correct readings from all of them. All of these are areas where Linux and other operating systems can have problems even if there are motherboard sensors. On Linux, you need a driver for your sensor chipset, then you need to reverse engineer what sensor is where (or what), and you may also need to know magic transformations to get correct sensor readings. And even under the best circumstances, sometimes kernel sensor readings can go crazy. At least in theory, the IPMI has all of the magic hardware specific knowledge necessary to sort all of this out (at least for onboard hardware; you're probably on your own for, say, an add-in GPU).
If you talk to the IPMI over the network you can get at least some sensor information even if the server has hung or locked up, or the on-host metrics agent isn't answering you (perhaps because the server is overloaded). This may give you valuable clues as to why a server has suddenly become unresponsive, or at least let you rule some things out. This can also be your only option to get sensor metrics if you can't run an agent on the host itself for some reason. Over the network IPMI sensor collection will also give you some information if the main host is powered off, although how useful this is may vary. Hopefully you'll never have to care about remotely reading the ambient temperature around a powered off server.
|
|