Machine room temperatures and the value of long Prometheus metrics history
We have a few machine rooms. These aren't high-tech, modern server rooms, which is not surprising since they've generally been there for decades. As part of this, our machine rooms don't really have a specific set temperature that they're supposed to stay at. They're not supposed to get too hot, but the actual temperature they're at varies over the year and depends on a lot of things, including what we're running in them at the moment. To make sure that everything is (still) working, we have temperature sensors in the machine rooms that feed into our Prometheus setup.
Recently we were looking at our dashboards and noticed that one of the machine rooms had an oddly high temperature. It wasn't alarmingly high, and we could see it going up and then jumping back down in a familiar pattern that we see in all of our machine rooms as the AC cycles on and off. But it felt like the temperature of that machine room should be lower and maybe something was wrong. Since we have a long metrics history (we keep years worth of Prometheus metrics), we started looking at historical temperature data for this machine room, both in the past of this year and at this time in previous years (to see if this was something that had happened at this time of year before).
Looking at historical data showed a clear difference in the pattern of temperatures between the recent past and before then, especially in the minimum temperatures; starting in late June, things start drifting slowly upward. This is a pattern we've never seen before and it's a pattern we don't see in the temperatures of our other machine room in the same building. We don't know if this is really a problem or if things are still okay and the AC is behaving safely and as expected, but at least we know that there's something clearly exceptional going on.
(And if there is a real problem, we've been given a chance to fix it before the temperature drifts so high it's a real problem and triggers our alarms. Well, we've been given a chance to call in the people who are responsible for the AC so they can fix it. Who is responsible for what in a university building can be complicated and a little tangled.)
However, getting this confidence took quite a deep metrics history, far longer than the 14-day retention that Prometheus defaults to. Right now, going back 90 days is barely enough to show the clear start of the deviation with some time before it, which means we really want to point at more than 90 days of data to show that this wasn't happening before then in smaller form. Being able to go back years (our metrics go back to late 2018) means we can more readily see how unusual this is.
Relatively short metrics retention works if the change you're looking at or into is obvious and big, and you catch it soon enough (and sometimes it's all that you can afford). But not all changes happen that fast; sometimes things just drift quietly over time. This incident shows me once again that it's useful to have a real historical reference so that you can go back to see how things used to be far enough ago that you've forgotten.