UPSes: defense against problems, or sources of them?
Here is something that we have been forced to think about lately: are UPSes really a good insurance policy against power problems, or are they instead an extra source of problems? In short, does using UPSes really increase your net reliability?
The problem with UPSes used by themselves is that they are another piece of machinery to fail (and they are a moderately complicated piece of machinery at that). And UPSes do fail; for example, we recently had an incident where a UPS reset itself out of the blue, briefly dropping power to everything connected to it (and it was not a power overload situation).
(Even when they don't fail outright, UPS batteries eventually age into uselessness and must be replaced, which generally requires you to take the UPS out of service.)
So the real question is what the MTBF of UPSes is compared to the mean time between power failures. For us, the mean time between power failures seems to be very large and visibly larger than the MTBF of our UPSes; since we put our current crop of UPSes into production we have had no power failures and at least one UPS failure. At the moment this appears to make UPSes a net negative, in that we are more likely to have power problems caused by UPSes than by actual power loss.
The way around this is to arrange for the UPS not to be a critical path component, so that if it fails things don't go down. However, this takes extra hardware for every machine; you need dual power supplies or the equivalent, so that you can have the machine still getting power even if the UPS fails. This is generally somewhat expensive.
(You can apparently get external power units that give you dual power sources, so that you can protect even 1U servers, basic switches, and other things that don't normally have an option for dual power supplies.)
When you want to spend extra money, you wind up asking yourself how much extra uptime your money is buying you. If power failures are extremely rare the answer may well be 'not much'. Certainly this issue has given us some things to think about.
(Paying extra for genuine UPS insurance, dual power supplies and all, may be worth it if it lets you run machines in otherwise unsafe configurations for extra performance, for example having disk write caches turned on. But this probably turns it into a question of how much the extra performance is worth to you, not how much the reliability is.)