Thoughts on potentially realistic temperature trip limit for hardware

April 21, 2024

Today one of the machine rooms that we have network switches in experienced some kind of air conditioning issue. During the issue, one of our temperature monitors recorded a high temperature of 44.1 C (it normally sees the temperature as consistently below 20C). The internal temperatures of our network switches undoubtedly got much higher than that, seeing as the one that I can readily check currently reports an internal temperature of 41 C while our temperature monitor says the room temperature is just under 20 C. Despite likely reaching very high internal temperatures, this switch (and probably others) did not shut down to protect themselves.

It's not news to system administrators that when hardware has temperature limits at all, those limits are generally set absurdly high. We know from painful experience that our switches experience failures and other problems when they get sufficiently hot during AC issues such as this, but I don't think we've ever seen a switch (or a server) shut down because of too-high temperatures. I'm sure that some of them will power themselves off if cooked sufficiently, but by that point a lot of damage will already be done.

So hardware vendors should set realistic temperature limits and we're done, right? Well, maybe not so fast. First off, there's some evidence that what we think of as typical ambient and internal air temperatures are too conservative. Google says they run data centers at 80 F or up to 95 F, depending on where you look, although this is with Google's custom hardware instead of off the shelf servers. Second, excess temperature in general is usually an exercise in probabilities and probable lifetimes; often the hotter you run systems, the sooner they will fail (or become more likely to fail). This gives you a trade off between intended system lifetime and operating temperature, where the faster you expect to replace hardware (eg in N years) the hotter you can probably run it (because you don't care if it starts dying after N+1 instead of N+2 years, in either case it'll be replaced by then).

And on the third hand, hardware vendors probably don't want to try to make tables and charts that explain all of this and, more importantly, more or less promise certain results from running their hardware at certain temperatures. It's much simpler and safer to promise less and then leave it up to (large) customers to conduct their own experiments and come up with their own results.

Even if a hardware vendor took the potential risk of setting 'realistic' temperature limits on their hardware, either they might still be way too high for us, because we want to run our hardware much longer than the hardware vendor expects, or alternately they could be too conservative and low, because we would rather take a certain amount of risk to our hardware than have everything aggressively shut down in the face of air conditioning problems (that aren't yet what we consider too severe) and take us entirely off the air.

(And of course we haven't even considered modifying any firmware temperature limits on systems where we could potentially do that. We lack the necessary data to do anything sensible, so we just stick with whatever the vendor has set.)


Comments on this page:

By Miksa at 2024-04-23 08:34:13:

I wouldn't be that worried about damages due to heat. We also had a cooling malfunction few years ago that topped out at 60°Celsius and I don't remember that it would have caused any damages. Although a bunch of servers turned themselves off as a cautionary measure around 50°Celsius. The problem started around 9 in the Saturday morning, reached 40 around 15 when servers started raising alerts, and 50 around 18 when firsts servers powered off. On Sunday midday me and couple others came to investigate. Max temp 60°Celsius was reached around 13 and after that we opened doors to outside. At 16 we found the root cause and the chillers came back online and the cooling sped up. About hour later we started turning on servers.

Written on 21 April 2024.
« What the original 4.2 BSD csh hashed (which is not what I thought)
Making virtual machine network interfaces inactive in Linux libvirt »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sun Apr 21 22:46:10 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.