How fast fileserver failover could matter to us

May 23, 2016

Our current generation fileservers don't have any kind of failover system, just like our original generation. A few years ago I wrote that we don't really miss failover, although I allowed that I might be overlooking situations where we'd have used failover if we had it. So, yeah, about that: on reflection, I think there is a relatively important situation where we could really use fast, reliable cooperative failover (when both the old and new hosts of a virtual fileserver are working properly).

Put simply, the advantage of fast cooperative failover is that it makes a number of things a lot less scary, because you can effectively experiment (assuming that the failover is basically user transparent). For instance, trying a new version of OmniOS in production, where it's very unlikely that it will crash outright but possible that we'll experience performance problems or other anomalies. With fast failover, we could roll a virtual fileserver on to a server running the new OmniOS, watch it, and have an immediate and low impact way out if something comes up.

(At one point this would have made our backups explode because the backups were tied to the real hosts involved. However we've changed that these days and backups are relatively easy to shift around.)

It's possible that we should take another look at failover in our current environment, since a lot of water has gone under the bridge since we last gave up on it. This also sparks a more radical thought; if we're going to use failover mostly as a way to do experiments, perhaps we should reorganize things so that some of our virtual fileservers are smaller than they are now so moving one over affects fewer people. Or at least we could have a 'sysadmin virtual fileserver' so we can test it with ourselves only at first.

(Our current overall architecture is sort of designed with the idea that a host has only one virtual fileserver and virtual fileservers don't really share disks with other ones, but we might be able to do some tweaks.)

All of this is a bit blue sky, but at the very least we should do a bit of testing to see how much time a cooperative fileserver failover might take in our current environment. I should also keep an eye out for future OmniOS changes that might improve it.

(As usual, re-checking one's core assumptions periodically is probably a good idea. Ideally we would have done some checking of this when we were initially testing each OmniOS version, but well. Hopefully next time, if there is one.)

Written on 23 May 2016.
« Our problem with OmniOS upgrades: we'll probably never do any more
SELinux is beyond saving at this point »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon May 23 22:24:23 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.