Our difficulties with OmniOS upgrades

March 16, 2015

We are not current on OmniOS and we've been having problems with it. At some point, well meaning people are going to suggest that we update to the current release version with the latest updates and mention that OmniOS makes this really quite easy with beadm and boot environments. Well, yes and no.

Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.

The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.

The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.

The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.

(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)

PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.

Written on 16 March 2015.
« The importance of user interface, illustrated by the Go flag package
Solving our authenticated SMTP problem by rethinking it »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 16 00:35:39 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.