Our difficulties with OmniOS upgrades

March 16, 2015

We are not current on OmniOS and we've been having problems with it. At some point, well meaning people are going to suggest that we update to the current release version with the latest updates and mention that OmniOS makes this really quite easy with beadm and boot environments. Well, yes and no.

Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.

The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.

The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.

The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.

(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)

PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.


Comments on this page:

By liam at unc edu at 2015-03-16 09:28:35:

This is standard system admin in many many places. Is your problem that you can't get maintenance windows from your customers? Or that your customers don't understand the concept of an emergency outage to rollback a change that causes problems?

One of the downsides of Linux over Solaris and AIX is the difficulty of providing a clean rollback to a system upgrade. If you are having issues with OmniOS which does have a rollback mechanism, why can't you just apply the practice you use for your Linux upgrades to your OmniOS boxes? Or is it that you create only 'cattle' with Linux, and leave OmniOS for 'pets'?

By cks at 2015-03-16 12:17:54:

All of these issues are surmountable but they make things non-easy (and really, non-trivial). As for emergency outages for rollbacks: we can have them, but then people may well get unhappy with us for needing them in the first place. From the perspective of users, what matters is us providing a reliable service; if we can't do that and if we're just flailing around (or if we do things that in practice make it worse), they're going to get very unhappy with us and start asking pointed questions about things.

On our Linux machines, we only have outages for non-optional security upgrades and we basically don't do rollbacks (partly this has been because we haven't encountered any fatal issues on them). For obvious reasons it would take a pretty severe problem for us to reintroduce a known, must-patch security issue.

By Erik Mathis at 2015-03-17 16:12:38:

This is a working example of why you make everything redundant from the get go. Plan for and expect midday outages, updates, emergency patch releases. It seems silly to me in 2015 to have only one server to do anything.

By cks at 2015-03-17 16:48:02:

It's extremely difficult to make a fileserver transparently redundant if you don't have fast failover, and unfortunately ZFS does not; pool import is a very slow process in at least some environments. Without transparent redundancy, users notice if you take a fileserver down for any amount of time.

(We have a hot spare fileserver, but even planned deliberate failover would probably be ten to twenty minutes and that much time is definitely user visible and has real effects on our overall environment even for users that are not on that fileserver.)

Written on 16 March 2015.
« The importance of user interface, illustrated by the Go flag package
Solving our authenticated SMTP problem by rethinking it »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 16 00:35:39 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.