2020-10-10
Wanting to be able to monitor for electrical power quality issues
Today, we had what you could call a "power event" at work. There was some sort of power blip in the electrical feeds from Toronto Hydro to multiple university buildings, and a number of our servers rebooted (not all of them, though). We've seen this kind of thing happen before so in one sense it's not surprising, but this time around the actual experience was rather alarming and confusing because it happened during the working day and we were all working remotely, so we couldn't go down to our machine room to see what was going on.
In the aftermath (and after my recent experience setting up monitoring of my home UPS), I have a new desire for us to be able to know what power blips and power issues have happened at work, in a way that gets recorded and is accessible remotely if enough of the network is still up. Currently, about all we can see is whatever various IPMI systems have recorded in their logs, and generally that amounts to not very much. And if the issue wasn't severe enough for the servers to notice, we don't know about it at all.
Monitoring a sufficiently capable UPS will definitely tell us about power failures (assuming that the system doing the monitoring stays up long enough to record what the UPS tells it). However, I don't know if inexpensive UPSes report smaller power blips or power issues that they detect and clean up, or how much load you have to have on them before they'll report things. While we can induce outright power failures as the UPS sees it by just unplugging it, I don't think we can test other sorts of power issues. Still, setting up communication with one of our existing UPSes that are capable of it would be a step forward and give us some information.
Some cursory Internet searches suggest that you can definitely get some products that do this in general (although who knows if we could get them to talk to a Linux server). In news that's no surprise, they are not all that inexpensive. If you need one, you're likely willing to pay the price, but probably we aren't; we have a casual interest, not a suspicion that our power is unstable in general.
(Based on the lack of information in IPMI logs from both machines that rebooted and ones that stayed up, it seems that server PSUs likely can't be used to monitor for this sort of stuff even with a cooperative IPMI. I can't blame them; the PSU's job is to keep the power on, not to report on how good it is.)
Sidebar: What we saw from outside
The initial symptoms were that a whole bunch of machines abruptly dropped off the network (conveniently this did not include our Prometheus server, so we got an alert that something was up). There are quite a number of potential causes of some but not all of your servers abruptly disappearing, including that you've had one or more switches fail (or reboot) or that you've lost some electrical circuits or rack PDUs. Things only got less alarming when servers started coming back up and reporting that they had just rebooted, but then we spent some time looking both at affected machines and ones not affected to try to figure out what had happened.