Wandering Thoughts archives

2015-03-16

Our difficulties with OmniOS upgrades

We are not current on OmniOS and we've been having problems with it. At some point, well meaning people are going to suggest that we update to the current release version with the latest updates and mention that OmniOS makes this really quite easy with beadm and boot environments. Well, yes and no.

Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.

The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.

The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.

The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.

(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)

PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.

OmniOSUpgradeDifficulties written at 00:35:39; Add Comment

2015-03-08

Why ZFS's 'directory must be empty' mount restriction is sensible

If you've used ZFS for a while, you may have run across the failure mode where some of your ZFS filesystems don't mount because the mount point directories have accidentally wound up with something in them. This isn't a general Unix restriction (even on Solaris); it's an extra limit that ZFS has added. And I actually think that it's a sensible restriction, although it gets in my way on rare occasions.

The problem with mounting a filesystem over a directory that has things in it is that those things immediately become inaccessible (unless you do crazy hacks). Unix lets you do this anyways for the same reason it lets you do other apparently crazy things; it assumes you know what you're doing and have a good reason for it.

The problem with doing this for ZFS mounts too is that ZFS mounts are generally implicit, not explicit, and as a result ZFS mounts can basically appear from nowhere. If you import a pool, all of its filesystems normally get automatically mounted at whatever their declared mount point is. When you imported that old pool to take a look at it (or maybe you're failing over a pool from one machine to another), did you remember that it had a filesystem with an unusual mountpoint that you've now turned into a normal directory?

(As it stands, you can't even find out this information about an unimported pool. You'd have to use a relatively unusual import command and then poke around the imported but hopefully inactive pool. Note that just 'zpool import -N' isn't quite enough.)

Given the potential risks created by magically appearing filesystems, ZFS's behavior here is at least defensible. Unix's previous behavior of allowing this at least required you to explicitly request a specific mount (more or less, let's wave our hands about /etc/fstab), so there was a pretty good chance that you really meant it even if you were going to make some stuff inaccessible. With ZFS pool imports, perhaps not so much.

(You can also get this magically appearing mount effect by assigning the mountpoint property, but you can argue that this is basically the same thing as doing an explicit mount. You're at least doing it to a specific object, instead of just taking whatever happens to be in a pool with whatever mountpoints they happen to have.)

ZFSSensibleMountRestriction written at 01:37:39; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.