How we monitor the temperature of our machine rooms

September 9, 2022

As I mentioned recently, we have machine rooms (many of them rather old) and with them a setup that monitors their temperatures, along with the temperatures of some of our important "wiring closets". The functional difference between a machine room and a wiring closet is that wiring closets are smaller and only have switches in them, not servers (and generally they have two-post aka "telco" racks instead of four-post server racks). We are what I'd consider a mid-sized organization, and here that means that we have real machine rooms (with dedicated AC and generally with raised floors) but not the latest and most modern datacenter grade equipment and setups. Including, of course, our temperature monitoring.

The actual temperature reading is done by network-accessible temperature sensor units, like the Control By Web X-DAQ-2R1-4T-E, which are basically boxes that you mount somewhere and plug into the network (and sometimes power). Each of these has some number of temperature probes connected to them by wires (which can be pretty long wires), and then the unit reports the readings of all connected temperature probes via either HTTP or SNMP depending on the unit. I suspect that the actual units are small embedded Linux machines, and thus are guaranteed to be running ancient versions of Linux. Our units have been very reliable so far, which is good because they're all at least fifteen years old and I'm not sure what modern replacements are available.

(Most or all of our units use Power over Ethernet, which was once very convenient because we already had a network with PoE switches, but which now means we have a more or less dedicated set of switches for them.)

These units aren't cheap; based on looking at list prices for current equivalents that I could easily find, you're probably looking at upwards of $300 US for a reasonably equipped setup. Plus the units need their own network connection on an isolated or secure network, because I certainly wouldn't trust their network stack. This limited how many of them we have and where we put them, and means that we don't have any spares. There are probably less expensive options if you want a single temperature sensor somewhere (even without going the DIY route). On the other hand, these have been solid, trouble free performers and we trust their temperature readings to be pretty accurate.

To get temperature readings from the units into our Prometheus system, I wrote some brute force scripts that either scrape their built in HTTP servers or query them by SNMP (for the one unit that really wants us to do that). The script collects the data, generates Prometheus metrics from it, and sends the metrics to Pushgateway, where Prometheus scrapes them. There's no particularly strong reason to use Pushgateway over, say, the host agent's "textfiles" collector; it's just how we started out. Once the temperature readings are in Prometheus, we use them to trigger alerts through some alert rules. We also have alerts if the temperature readings are stale or missing, and we have Prometheus ping all of the sensor units and alert if one of them stops responding.

Getting these temperature readings requires a fair amount of infrastructure to be working; there's the temperature unit itself, its PoE switch, and the network switches between it and the Prometheus server (although mostly not a router or firewall; the Prometheus server is directly on most of the networks the temperature sensors are one). Because we consider machine room temperature monitoring to be relatively critical, we've recently been looking at backup temperature data sources. One of them is that some of our servers have IPMI temperature sensors for the 'ambient' or 'inlet' temperature. We don't currently trust these readings as much as we trust the sensor units, but we can at least trigger alerts based on clearly extreme readings.

(Hence also part of our recent interest in USB temperature sensors.)

PS: There are more DIY approaches to temperature monitoring units, but if you have a genuine machine room I'd strongly suggest paying the money for a dedicated unit. Among other things, I suspect that you'll get more trustworthy temperature readings. And in many settings, the cost of your time to build out and maintain a DIY solution is more than a dedicated unit will cost.

Comments on this page:

I recently wrote about the problems I've had with air temperature sensing devices (especially ones with bigger micros/SoCs such as networked ones). You might find the advice in this useful if/when window shopping for new ones.

TL;DR: Measuring ambient air temperature is very easy to do wrong, most of the devices I've looked at or used give wildly results different (eg error 5degC) unless you force airflow over them. You can't calibrate this out because position, airflow and self-heating change unpredictably over time.

This might (?) be fine in a closet environment where you can mount it nearer a fan's stream? Or perhaps an error of 5degC doesn't matter too much in your application (whilst in my applications of indoor/human/AC monitoring it's almost all of the expected temperature swing range).

Everything in this market is either IoT (cheap but unreliable) or expensive (might be good, might be trash). Price alone is not a good indicator, I've dealt with many expensive sensors that don't seem like they were tested beyond a simple home LAN with no routing and the proprietary server software running on a dev's laptop for a few hours.

By cks at 2022-09-10 22:54:20:

It's useful for us to have pretty accurate room temperature readings, because then we can tell the facilities people more about what's going on (and also have confidence about smaller anomalies than outright AC failure). But, fortunately, as a first line of defense and alerting, what we care most about is a clear relative change in temperature. If a sensor reads high or low under normal room conditions (even extremely so) but will climb up significant when the AC dies and the room temperature goes up a bunch, we can work with that.

We do have to place the temperature probes in useful places, and we haven't always been completely successful in that. Moving the existing probes would be a fair amount of work, so they're probably staying where they are. We have at least seen their readings go up when the AC has had problems.

(I don't think we've actually ever tried to carefully cross-check our existing temperature probe readings against other sources of truth, so we don't know how accurate they are, although I think they're probably okay.)

Ah, sorry. I missed that you are using external probes. That avoids 90% of the problems.

With external probes I would completely trust the manufacturer's datasheet for the product. The only worry left is position relative to aircon vs heat exhaust flows, but it seems like you already have something that works.

If it works for you then stick with it.

By cks at 2022-09-11 06:49:26:

The industrial units all seem to use external probes, quite probably partly for the reasons your article talks about (although the product pages also talk about things like putting the probe in a refrigerated area with the unit outside). However, what you wrote about is definitely something we'll have to remember for USB-based things like the TEMPer2, which have an onboard sensor (as well as a probe in their case; some of the series only have an onboard sensor). Fortunately the TEMPer2 has somewhat odd results in general so I'm already biased to not trust it too much.

If you're going as far as using USB devices and polling them: I'd suggest also taking a look at the DS18B20.

You can get them in pre-made metal pills with wires already attached. Raspis speak 1-wire natively and the sysfs interface is very very simple (it spits the decoded degC out when you read the file). Crimp the 3 wires (or 2 wires + a resistor) onto a header and you're ready.

Alas I suspect you might be thinking of plugging these USB devices into existing servers, rather than adding new computers/SBCs just to do temperature measurements, so this might not suit.

ith this method don't need USB drivers or any vendor-made software. One line of shell is enough to do a reading. Also there will be other SBCs with working 1-wire kernel implementations, so you're not beholden to the ebb and flow of raspi stocking levels.

Written on 09 September 2022.
« Grafana's problem with the order of dashboard panel legends and Prometheus
C's malloc() and free() APIs are reasonable APIs for C »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 9 22:07:44 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.