Why OmniOS boot environments don't solve our upgrade issues

October 7, 2016

As I sort of mentioned yesterday and have noted in passing in various other entries, we don't upgrade our OmniOS fileservers. Unlike our iSCSI backends, this is not a situation where there is almost no benefits to upgrading; even within a single OmniOS release there are often reasonably attractive updates and bug fixes that might matter to us. Instead the blocker is pretty simple; upgrading a fileserver is a risky thing where there would be major disruption if something went wrong and the system crashed, became balky, or simply slowed down too much.

(And past incidents have convinced us that we can't duplicate our production load in a test environment. We can do some tests, but not enough.)

In theory, Solaris / Illumos / OmniOS boot environments look like the solution to this. Upgrade in a new boot environment, switch over to it, and if things go wrong we can switch right back. Setting aside issues like how much a boot environment does or doesn't capture (and what we might want to retain if we had to roll back), there's a deeper problem in that 'switch over to it' bit I casually tossed off. Switching boot environments takes a reboot.

Reboots aren't fast or transparent, even for NFS fileservers. If we have to reboot a fileserver, we can count on user-visible interruptions on several services as NFS client machines grind to a halt trying to get files from the rebooting fileserver. Our IMAP server, our Samba server, our departmental web server, and our primary login server would all stall out and probably not recover until five or ten minutes after the fileserver was back up.

Boot environments certainly lower the risks here; it's unquestionably better to have a flawed but relatively fast fallback instead of either no fallback or a slow one. But it's not good enough to take the risk down to a relatively trivial level, and that means it doesn't really solve our issues with upgrading our OmniOS fileservers.

(This is probably one of those times where having fast (virtual) fileserver failover would make a real difference. Fileserver failover that was fast enough and good enough to be effectively transparent would significantly lower the risks and the disruption(s). Well, assuming that an upgraded fileserver didn't fall over in a way that got in the way of fast failover, and that it was courteous enough to fail during the working day when we were paying attention.)

Written on 07 October 2016.
« How we could update our iSCSI backends and why we probably won't
I have a blind spot where it comes to using chmod's symbolic modes »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 7 00:41:56 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.