How we monitor the temperature of our machine rooms

September 9, 2022

As I mentioned recently, we have machine rooms (many of them rather old) and with them a setup that monitors their temperatures, along with the temperatures of some of our important "wiring closets". The functional difference between a machine room and a wiring closet is that wiring closets are smaller and only have switches in them, not servers (and generally they have two-post aka "telco" racks instead of four-post server racks). We are what I'd consider a mid-sized organization, and here that means that we have real machine rooms (with dedicated AC and generally with raised floors) but not the latest and most modern datacenter grade equipment and setups. Including, of course, our temperature monitoring.

The actual temperature reading is done by network-accessible temperature sensor units, like the Control By Web X-DAQ-2R1-4T-E, which are basically boxes that you mount somewhere and plug into the network (and sometimes power). Each of these has some number of temperature probes connected to them by wires (which can be pretty long wires), and then the unit reports the readings of all connected temperature probes via either HTTP or SNMP depending on the unit. I suspect that the actual units are small embedded Linux machines, and thus are guaranteed to be running ancient versions of Linux. Our units have been very reliable so far, which is good because they're all at least fifteen years old and I'm not sure what modern replacements are available.

(Most or all of our units use Power over Ethernet, which was once very convenient because we already had a network with PoE switches, but which now means we have a more or less dedicated set of switches for them.)

These units aren't cheap; based on looking at list prices for current equivalents that I could easily find, you're probably looking at upwards of $300 US for a reasonably equipped setup. Plus the units need their own network connection on an isolated or secure network, because I certainly wouldn't trust their network stack. This limited how many of them we have and where we put them, and means that we don't have any spares. There are probably less expensive options if you want a single temperature sensor somewhere (even without going the DIY route). On the other hand, these have been solid, trouble free performers and we trust their temperature readings to be pretty accurate.

To get temperature readings from the units into our Prometheus system, I wrote some brute force scripts that either scrape their built in HTTP servers or query them by SNMP (for the one unit that really wants us to do that). The script collects the data, generates Prometheus metrics from it, and sends the metrics to Pushgateway, where Prometheus scrapes them. There's no particularly strong reason to use Pushgateway over, say, the host agent's "textfiles" collector; it's just how we started out. Once the temperature readings are in Prometheus, we use them to trigger alerts through some alert rules. We also have alerts if the temperature readings are stale or missing, and we have Prometheus ping all of the sensor units and alert if one of them stops responding.

Getting these temperature readings requires a fair amount of infrastructure to be working; there's the temperature unit itself, its PoE switch, and the network switches between it and the Prometheus server (although mostly not a router or firewall; the Prometheus server is directly on most of the networks the temperature sensors are one). Because we consider machine room temperature monitoring to be relatively critical, we've recently been looking at backup temperature data sources. One of them is that some of our servers have IPMI temperature sensors for the 'ambient' or 'inlet' temperature. We don't currently trust these readings as much as we trust the sensor units, but we can at least trigger alerts based on clearly extreme readings.

(Hence also part of our recent interest in USB temperature sensors.)

PS: There are more DIY approaches to temperature monitoring units, but if you have a genuine machine room I'd strongly suggest paying the money for a dedicated unit. Among other things, I suspect that you'll get more trustworthy temperature readings. And in many settings, the cost of your time to build out and maintain a DIY solution is more than a dedicated unit will cost.

Written on 09 September 2022.
« Grafana's problem with the order of dashboard panel legends and Prometheus
C's malloc() and free() APIs are reasonable APIs for C »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 9 22:07:44 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.