2010-05-29
UPSes: defense against problems, or sources of them?
Here is something that we have been forced to think about lately: are UPSes really a good insurance policy against power problems, or are they instead an extra source of problems? In short, does using UPSes really increase your net reliability?
The problem with UPSes used by themselves is that they are another piece of machinery to fail (and they are a moderately complicated piece of machinery at that). And UPSes do fail; for example, we recently had an incident where a UPS reset itself out of the blue, briefly dropping power to everything connected to it (and it was not a power overload situation).
(Even when they don't fail outright, UPS batteries eventually age into uselessness and must be replaced, which generally requires you to take the UPS out of service.)
So the real question is what the MTBF of UPSes is compared to the mean time between power failures. For us, the mean time between power failures seems to be very large and visibly larger than the MTBF of our UPSes; since we put our current crop of UPSes into production we have had no power failures and at least one UPS failure. At the moment this appears to make UPSes a net negative, in that we are more likely to have power problems caused by UPSes than by actual power loss.
The way around this is to arrange for the UPS not to be a critical path component, so that if it fails things don't go down. However, this takes extra hardware for every machine; you need dual power supplies or the equivalent, so that you can have the machine still getting power even if the UPS fails. This is generally somewhat expensive.
(You can apparently get external power units that give you dual power sources, so that you can protect even 1U servers, basic switches, and other things that don't normally have an option for dual power supplies.)
When you want to spend extra money, you wind up asking yourself how much extra uptime your money is buying you. If power failures are extremely rare the answer may well be 'not much'. Certainly this issue has given us some things to think about.
(Paying extra for genuine UPS insurance, dual power supplies and all, may be worth it if it lets you run machines in otherwise unsafe configurations for extra performance, for example having disk write caches turned on. But this probably turns it into a question of how much the extra performance is worth to you, not how much the reliability is.)
Some comments on spam scoring and anti-spam tools in general
Here's something important if you're designing or considering a new anti-spam system (as we may be at some point). It may sound obvious, but I think it's not:
If you run an email system, part of your job is filtering spam for your users.
It used to be that you could provide your users a collection of anti-spam options and tools and settings and so on, and consider your work done. Those days are over and gone. Much as with computer security, users have neither the expertise to make sensible decisions about this stuff nor any interest in acquiring it. Dumping a bunch of tools in their laps and running away is more or less the equivalent of doing no spam filtering whatsoever, and is about as unacceptable in practice to most people.
(So why did we get away doing just that for quite a while? I think it's a number of reasons; for a long time it wasn't a big problem, and for a fair while after that no one demonstrated to users that you could actually make the spam problem go away. Nowadays, lots of people have experience with places like GMail that have spam mostly fixed, so they know it can be done. And if GMail can do it, why should they settle for less elsewhere?)
There are some immediate corollaries to this. One of them is what I
noted in passing recently: spam scoring is
effectively spam filtering. Users will directly take your spam score
and filter on it (and then judge you on how well it works), especially
if you explicitly mark things that have hit some threshold score (for
example, by changing the email's Subject:
). The same is true if you
provide a standard 'here is how to do filtering' configuration that
users can adopt and customize, because most users won't; whatever this
configuration is is effectively your spam filtering.
(And if it doesn't work or malfunctions, yes, you will be blamed.)
The usual answer to this is that you won't work out a score (with its possibly charged politics and potential constant demands for tuning), whatever software you're using will just tag messages with various characteristics and leave it to users to decide which ones are bad enough to filter on. This doesn't work, because at best you're back to dumping a bunch of tools in peoples' laps and running away. (At worst they will seize on some obvious bit of the tagging, decide to filter on it, and then blame you when things explode.)
PS: regardless of who is really at 'fault', users feeling that they are getting a terrible, unusable email system is never a good thing. You want to avoid it if at all possible, and remember that the hard problems are the social ones. This applies at multiple levels here.