Wandering Thoughts archives

2019-06-27

Our last OmniOS fileserver is now out of production (and service)

On Twitter, I noted a milestone last evening:

This evening we took our last OmniOS fileserver out of production and powered it off (after a great deal of slow work; all told this took more than a year). They've had a good run, so thank you Illumos/OmniOS/OmniTI/etc for the generally quiet and reliable service.

We still haven't turned any of our iSCSI backends off (they're Linux, not OmniOS), but that will be next, probably Friday (the delay is just in case). Then we'll get around to recycling all of the hardware for some new use, whatever it will turn out to be.

When we blank out the OmniOS system disks as part of recycling the hardware, that really will be the end of the line for the whole second generation of our fileserver infrastructure and the last lingering traces of our long association with Sun will be gone, swallowed by time.

It's been pointed out to me by @oclsc that since we're still using ZFS (now ZFS on Linux), we still have a tie to Sun's lineage. It doesn't really feel the same, though; open source ZFS is sort of a lifeboat pushed out of Sun toward the end, not Sun(ish) itself.

(This is probably about as fast as I should have expected from having almost all of the OmniOS fileservers out of production at the end of May. Things always come up.)

Various people and groups at the department have been buying Sun machines and running Sun OSes (first SunOS and then Solaris) almost from the beginning of Sun. I don't know if we bought any Sun 1s, but I do know that some Sun 2s were, and Sun 3s and onward were for many years a big presence (eventually only as servers, although we did have some Sunrays). With OmniOS going out of service, that is the end of our use of that lineage of Unix.

(Of course Sun itself has been gone for some time, consumed by Oracle. But our use of its lineage lived on in OmniOS, since Illumos is more or less Solaris in open source form (and improved from when it was abandoned by its corporate parent).)

I have mixed feelings about OmniOS and I don't have much sentimentality about Solaris itself (it's complicated). But I still end up feeling that there is a weight of history that has shifted here in the department, at the end of a long slow process. Sun is woven through the history of the department's computing, and now all that remains of that is our use of ZFS.

(For all that I continue to think that ZFS is your realistic choice for an advanced filesystems, I also think that we probably wouldn't have wound up using it if we hadn't started with Solaris.)

OmniOSEndOfService written at 22:29:11; Add Comment

2019-06-26

A hazard of our old version of OmniOS: sometimes powering off doesn't

Two weeks ago, I powered down all of our OmniOS fileservers that are now out of production, which is most of them. By that, I mean that I logged in to each of them via SSH and ran 'poweroff'. The machines disappeared from the network and I thought nothing more of it.

This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.

Today, I logged in to the three that had come back, ran 'poweroff' on them again, and then later went down to the machine room to pull out their power cords. To my surprise, when I looked at the physical machines, they had little green power lights that claimed they were powered on. When I plugged in a roving display and keyboard to check their state, I discovered that all three were still powered on and sitting displaying an OmniOS console message to the effect that they were powering off. Well, they might have been trying to power off, but they weren't achieving it.

I rather suspect that this is what happened two weeks ago, and why these machines all sprang back to life after the power failure. If OmniOS never actually powered the machines off, even a BIOS setting of 'resume last power state after a power failure' would have powered the machines on again, which would have booted OmniOS back up again. Two weeks ago, I didn't go look at the physical servers or check their power state through their lights out management interface; it never occurred to me that 'poweroff' on OmniOS sometimes might not actually power the machine off, especially when the machines did drop off the network.

(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)

At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).

HazardOfNoPoweroff written at 01:01:17; Add Comment

2019-06-03

Almost all of our OmniOS machines are now out of production

Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.

(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)

This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.

(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)

The MVP of the migration is clearly 'zfs send | zfs recv' (as it always has been). Having to do the migrations with something like rsync would likely have been much more painful for various reasons; ZFS snapshots and ZFS send are things that just work, and they come with solid and extremely reassuring guarantees. Part of their importance was that the speed of an incremental ZFS send meant that the user-visible portion of a migration (where we had to take their filesystem away temporarily) could be quite short (short enough to enable opportunistic migrations, if we could see that no one was using some of the filesystems).

At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.

(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)

I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.

(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)

PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.

OmniOSMostlyEndOfService written at 21:44:16; Add Comment

By day for June 2019: 3 26 27; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.