Why OmniOS boot environments don't solve our upgrade issues

October 7, 2016

As I sort of mentioned yesterday and have noted in passing in various other entries, we don't upgrade our OmniOS fileservers. Unlike our iSCSI backends, this is not a situation where there is almost no benefits to upgrading; even within a single OmniOS release there are often reasonably attractive updates and bug fixes that might matter to us. Instead the blocker is pretty simple; upgrading a fileserver is a risky thing where there would be major disruption if something went wrong and the system crashed, became balky, or simply slowed down too much.

(And past incidents have convinced us that we can't duplicate our production load in a test environment. We can do some tests, but not enough.)

In theory, Solaris / Illumos / OmniOS boot environments look like the solution to this. Upgrade in a new boot environment, switch over to it, and if things go wrong we can switch right back. Setting aside issues like how much a boot environment does or doesn't capture (and what we might want to retain if we had to roll back), there's a deeper problem in that 'switch over to it' bit I casually tossed off. Switching boot environments takes a reboot.

Reboots aren't fast or transparent, even for NFS fileservers. If we have to reboot a fileserver, we can count on user-visible interruptions on several services as NFS client machines grind to a halt trying to get files from the rebooting fileserver. Our IMAP server, our Samba server, our departmental web server, and our primary login server would all stall out and probably not recover until five or ten minutes after the fileserver was back up.

Boot environments certainly lower the risks here; it's unquestionably better to have a flawed but relatively fast fallback instead of either no fallback or a slow one. But it's not good enough to take the risk down to a relatively trivial level, and that means it doesn't really solve our issues with upgrading our OmniOS fileservers.

(This is probably one of those times where having fast (virtual) fileserver failover would make a real difference. Fileserver failover that was fast enough and good enough to be effectively transparent would significantly lower the risks and the disruption(s). Well, assuming that an upgraded fileserver didn't fall over in a way that got in the way of fast failover, and that it was courteous enough to fail during the working day when we were paying attention.)

Comments on this page:

How difficult would it be to get buy-in for one or two maintenance windows per year?

Worked at place where for a while we didn't have any opportunities and it was difficult to get important fixes in. We managed to get weekend "company-wide" outages windows twice a year and it allows us to do important upgrades (like swapping network gear, upgrading file servers).

100% uptime is nice, but if you really want it, it gets expensive. You end up paying for it one way or another I've found.

By cks at 2016-10-07 20:25:56:

There are two problems, broadly. First, even reboots scheduled well in advance are somewhat disruptive to our users, so the fewer of them the better. We want our services to be something that just works, not something that you have to interrupt once a week or once a month or the like. Second, the bigger issue isn't pre-scheduled downtime to do a reboot for an upgrade but the unscheduled or short-notice reboot if something goes wrong with the upgraded version.

Boot environments make the actual upgrade downtime somewhat shorter and faster (depending on various factors), and they make it easier to recover from problems. But they don't make it not disruptive at all.

(Making it not disruptive at all would create a real qualitative change; there would probably be a lot of things we'd be much more willing to do and try. But that's not very likely to be possible any time soon, if ever.)

Written on 07 October 2016.
« How we could update our iSCSI backends and why we probably won't
I have a blind spot where it comes to using chmod's symbolic modes »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 7 00:41:56 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.