Almost all of our OmniOS machines are now out of production

June 3, 2019

Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.

(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)

This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.

(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)

The MVP of the migration is clearly 'zfs send | zfs recv' (as it always has been). Having to do the migrations with something like rsync would likely have been much more painful for various reasons; ZFS snapshots and ZFS send are things that just work, and they come with solid and extremely reassuring guarantees. Part of their importance was that the speed of an incremental ZFS send meant that the user-visible portion of a migration (where we had to take their filesystem away temporarily) could be quite short (short enough to enable opportunistic migrations, if we could see that no one was using some of the filesystems).

At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.

(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)

I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.

(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)

PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.

Written on 03 June 2019.
« Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE
Go channels work best for unidirectional communication, not things with replies »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jun 3 21:44:16 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.