A hazard of our old version of OmniOS: sometimes powering off doesn't
Two weeks ago, I powered down all of our OmniOS fileservers that
are now out of production, which is
most of them. By that, I mean that I logged in to each of them via
SSH and ran '
poweroff'. The machines disappeared from the network
and I thought nothing more of it.
This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.
Today, I logged in to the three that had come back, ran '
on them again, and then later went down to the machine room to pull
out their power cords. To my surprise, when I looked at the physical
machines, they had little green power lights that claimed they were
powered on. When I plugged in a roving display and keyboard to check
their state, I discovered that all three were still powered on and
sitting displaying an OmniOS console message to the effect that they
were powering off. Well, they might have been trying to power off,
but they weren't achieving it.
I rather suspect that this is what happened two weeks ago, and why
these machines all sprang back to life after the power failure. If
OmniOS never actually powered the machines off, even a BIOS setting
of 'resume last power state after a power failure' would have powered
the machines on again, which would have booted OmniOS back up again.
Two weeks ago, I didn't go look at the physical servers or check
their power state through their lights out management interface;
it never occurred to me that '
poweroff' on OmniOS sometimes might
not actually power the machine off, especially when the machines
did drop off the network.
(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)
At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).
Almost all of our OmniOS machines are now out of production
Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.
(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)
This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.
(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)
The MVP of the migration is clearly
zfs send | zfs recv' (as it always has been). Having to do the
migrations with something like
rsync would likely have been much
more painful for various reasons; ZFS snapshots and ZFS send are
things that just work, and they come with solid and extremely
reassuring guarantees. Part of their importance was that the speed
of an incremental ZFS send meant that the user-visible portion of
a migration (where we had to take their filesystem away temporarily)
could be quite short (short enough to enable opportunistic migrations,
if we could see that no one was using some of the filesystems).
At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.
(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)
I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.
(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)
PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.
Some things on how ZFS dnode object IDs are allocated (which is not sequentially)
One of the core elements of ZFS are dnodes, which define DMU objects. Within a single filesystem or other object sets, dnodes have an object number (aka object id). For dnodes that are files or directories in a filesystem, this is visible as their Unix inode number, but other internal things get dnodes and thus object numbers (for example, the dnode of the filesystem's delete queue). Object ids are 64-bit numbers, and many of them can be relatively small (especially if they are object ids for internal structures, again such as the delete queue). Very large dnode numbers are uncommon, and some files and directories from early in a filesystem's life can have very small object IDs.
(For instance, the object ID of my home directory on our ZFS fileservers is '5'. I'm the only user in this filesystem.)
You might reasonably wonder how ZFS object IDs are allocated. Inspection of a ZFS filesystem will show that they are clearly not allocated sequentially, but they're also not allocated randomly. Based on an inspection of the dnode allocation source code in dmu_object.c, there seem to be two things going on to spread dnode object ids around some (but not too much).
The first thing is that dnode allocation is done from per-CPU chunks of the dnode space. The size of each chunk is set by dmu_object_alloc_chunk_shift, which by default creates 128-dnode chunks. The motivation for this is straightforward; if all of the CPUs in the system were all allocating dnodes from the same area, they would all have to content over locks on this area. Spreading out into separate chunks reduces locking contention, which means that parallel or highly parallel workloads that frequently create files on a single filesystem don't bottleneck on a shared lock.
(One reason that you might create files a lot in a parallel worklog is if you're using files on the filesystem as part of a locking strategy. This is still common in things like mail servers, mail clients, and IMAP servers.)
The second thing is, well, I'm going to quote the comment in the source code to start with:
Each time we polish off a L1 bp worth of dnodes (2^12 objects), move to another L1 bp that's still reasonably sparse (at most 1/4 full). Look from the beginning at most once per txg. If we still can't allocate from that L1 block, search for an empty L0 block, which will quickly skip to the end of the metadnode if no nearby L0 blocks are empty. This fallback avoids a pathology where full dnode blocks containing large dnodes appear sparse because they have a low blk_fill, leading to many failed allocation attempts. [...]
(In reading the code a bit, I think this comment means 'L2 block' instead of 'L0 block'.)
To understand a bit more about this, we need to know about two things. First, we need to know that dnodes themselves are stored in another DMU object, and this DMU object stores data in the same way as all others do, using various levels of indirect blocks. Then we need to know about indirect blocks themselves. L0 blocks directly hold data (in this case the actual dnodes), while L1 blocks hold pointers to L0 blocks and L2 blocks hold pointers to L1 blocks.
(You can see examples of this structure for regular files in the
zdb output in this entry and this
entry. If I'm doing the math right,
for dnodes a L0 block normally holds 32 dnodes and a L<N> block
can address up to 128 L<N-1> blocks, through block pointers.)
So, what appears to happen is that at first, the per-CPU allocator gets its chunks sequentially (for different CPUs, or the same CPU) from the same L1 indirect block, which covers 4096 dnodes. When we exhaust all of the 128-dnode chunks in a single group of 4096, we don't move to the sequentially next group of 4096; instead we search around for a sufficiently empty group, and switch to it (where a 'sufficiently empty' group is one with at most 1024 dnodes already allocated). If there is no such group, I think that we may wind up skipping to the end of the currently allocated dnodes and getting a completely fresh empty block of 4096.
If I'm right, the net effect of this is to smear out dnode allocations and especially reallocations over an increasingly large portion of the lower dnode object number space. As your filesystem gets used and files get deleted, many of the lower 4096-dnode groups will have some or even many free dnodes, but not the 3072 that they need to be eligible for be selected for further assignment. This can eventually push dnode allocations to relatively high object numbers even though you may not have anywhere near that many dnodes in use on the filesystem. This is not guaranteed, though, and you may still reuse dnode numbers.
(For example, I just created a new file in my home directory. My home directory's filesystem has 1983310 dnodes used right now, but the inode number (and thus dnode object number) that my new test file got was 1804696.)
One of our costs of using OmniOS was not having 10G networking
OmniOS has generally been pretty good to us over the lifetime of our second generation ZFS fileservers, but as we've migrated various filesystems from our OmniOS fileservers to our new Linux fileservers, it's become clear that one of the costs we paid for using OmniOS was not having 10G networking.
We certainly started out intending to have 10G networking on OmniOS; our hardware was capable of it, with Intel 10G-T chipsets, and OmniOS seemed happy to drive them at decent speeds. But early on we ran into a series of fatal problems with the Intel ixgbe driver which we never saw any fixes for. We moved our OmniOS machines (and our iSCSI backends) back to 1G, and they have stayed there ever since. When we made this move, we did not have detailed system metrics on things like NFS bandwidth usage by clients, and anyway almost all of our filesystems were on HDs, so 1G seemed like it should be fine. And indeed, we mostly didn't see obvious and glaring problems, especially right away.
What setting up a metrics system (even only on our NFS clients) and
then later moving some filesystems from OmniOS (at 1G) to Linux (at
10G) made clear was that on some filesystems, we had definitely
been hitting the 1G bandwidth limit and doing so had real impacts.
The filesystem this was most visible on is the one that holds
/var/mail, our central location for people's mailboxes (ie, their
IMAP inbox). This was always on SSDs even on OmniOS, and once we
started really looking it was clearly bottlenecked at 1G. It was
one of the early filesystems we moved to the Linux fileservers, and
the improvement was very visible. Our IMAP server, which has 10G
itself, now routinely has bursts of over 200 Mbps inbound and
sometimes sees brief periods of almost saturated network bandwidth.
More importantly, the IMAP server's performance is visibly better;
it is less loaded and more responsive, especially at busy times.
(A contributing factor to this is that any number of people have
very big inboxes, and periodically our IMAP server winds up having
to read through all of such an inbox. This creates a very asymmetric
traffic pattern, with huge inbound bandwidth from the
fileserver to the IMAP server but very little outbound traffic.)
It's less clear how much of a cost we paid for HD-based filesystems, but it seems pretty likely that we paid some cost, especially since our OmniOS fileservers were relatively large (too large, in fact). With lots of filesystems, disks, and pools on each fileserver, it seems likely that there would have been periods where each fileserver could have reached inbound or outbound network bandwidth rates above 1G, if they'd had 10G networking.
(And this excludes backups, where it seems quite likely that 10G would have sped things up somewhat. I don't consider backups as important as regular fileserver NFS traffic because they're less time and latency sensitive.)
At the same time, it's quite possible that this cost was still worth paying in order to use OmniOS back then instead of one of the alternatives. ZFS on Linux was far less mature in 2013 and 2014, and I'm not sure how well FreeBSD would have worked, especially if we insisted on keeping a SAN based design with iSCSI.
(If we had had lots of money, we might have attempted to switch to other 10G networking cards, probably SFP+ ones instead of 10G-T (which would have required switch changes too), or to commission someone to fix up the ixgbe driver, or both. But with no funds for either, it was back to 1G for us and then the whole thing was one part of why we moved away from Illumos.)
A ZFS resilver can be almost as good as a scrub, but not quite
We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.
I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.
ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.
For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.
(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)
Our plan for handling TRIM'ing our ZFS fileserver SSDs
The versions of ZFS that we're running on our fileservers (both
the old and the new) don't support using
on drives in ZFS pools. Support for
TRIM has been in FreeBSD ZFS
for a while,
but it only just landed in the ZFS on Linux development version
and it's not in Illumos. Given our general upgrade plans, we're also not likely to
TRIM support over the likely production lifetime of our current
ZFS SSDs through upgrading the OS and ZFS versions later. So you
might wonder what our plans are to deal with how SSD performance
can decrease when they think they're all filled up, if you don't
TRIM them or otherwise deallocate blocks every so often.
Honestly, the first part of our plan is to ignore the issue unless we see signs of performance problems. This is not ideal but it is the simplest approach. It's reasonably likely that our ZFS fileservers will be more limited by NFS and networking than by SSD performance, and as far as I understand things, nominally full SSDs mostly suffer from write performance issues, not read performance. Our current view (only somewhat informed by actual data) is that our read volume is significantly higher than our write volume. We certainly aren't currently planning any sort of routine preventative work here, and we wouldn't unless we saw problem signs.
If we do see problems signs and do need to clear SSDs, our plan is
to do the obvious brute force thing in a ZFS setup with redundancy.
Rather than try to
TRIM SSDs in place, we'll entirely spare out
a given SSD so that it has no live data on it, and then completely
clear it, probably using Linux's
blkdiscard. We might do this in place on
a production fileserver, or we might go to the extra precaution of
pulling the SSD out entirely, swapping in a freshly cleared one,
and clearing the old SSD on a separate machine. Doing this swap has
the twin advantages that we're not risking accidentally clearing
the wrong SSD on the fileserver and we don't have to worry about
the effects of an extra-long, extra-slow SATA command on the rest
of the system and the other drives.
(This plan, such as it is, is not really new with our current generation Linux fileservers. We've had one OmniOS fileserver that used SSDs for a few special pools, and this was always our plan for dealing with any clear problems due to the SSDs slowing down due to being full up. We haven't had to use it, but then we haven't really gone looking for performance problems with its SSDs. They seem to still run fast enough after four or more years, and so far that's good enough for us.)
Drifting away from OmniOS (CE)
Toward the end of last year (2018), the OmniOS CE people got around to migrating the OmniOS user mailing list from its old home on OmniTI's infrastructure to a new home. When they did this, they opted not to move over the existing list membership; instead, people who were still interested had to actively subscribe themselves to the new mailing list. At first, when I got the notice about this I thought I'd subscribe to the new list. Then I thought about it a bit more and quietly let my subscription to omnios-discuss tacitly lapse when the old mailing lists were completely decommissioned at the end of the year.
The reality is that while we still run our OmniOS fileservers, this is only because our migration from them to our next generation of servers is a slow process. We have been quietly drifting away from OmniOS ever since we made the decision to use Linux instead in our next generation, and that has only sped up now that we have new fileservers in production. Our OmniOS machines are now in a de facto 'end of life' maintenance mode; we touch them as little as possible, and if they were to develop problems our response would be to accelerate the migration of filesystems away from them.
(On top of that, my ability to contribute to omnios-discuss has been tenuous in general for some time. Partly this is because we are so far behind in OmniOS versions (we're still on r151014, and yes we know that is well out of support at this point), and partly this is because my OmniOS knowledge is rusting away from disuse. The code for my DTrace scripts is increasingly a foreign land, for example (although I remember how to use them and we still rely on them for diagnostics at times).)
I feel sentimentally sad about this. Although we only ran it for one generation of fileservers, which will amount to five years or so by the time we're done, OmniOS itself was mostly quite good for us and the OmniTI and OmniOS people on omnios-discuss were great. It was a good experience, even though we paid a price for choosing OmniOS, and I'm still reasonably convinced that it was our best choice at the time we made it.
(I'll feel more sentimental when we turn off the first OmniOS ex-production machine, and again when the last one goes out of production, as our Solaris 10 machines eventually did. We'll be lucky if that happens before the end of summer, though.)
A bit more on ZFS's per-pool performance statistics
In my entry on ZFS's per-pool stats, I said:
In terms of Linux disk IO stats, the
*timestats are the equivalent of the
usestat, and the
*lentimestats are the equivalent of the
aveqfield. There is no equivalent of the Linux
wusefields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.
The discussion of the
*lentime stats in the manpage and the relevant header
are very complicated and abstruse. I am sure they make sense to
people for whom the phrase 'a Rieman sum' is perfectly natural,
but I am not such a person.
Having ground through a certain amount of arguments with myself and
experimentation, I now believe that the ZFS
are functionally equivalent to the Linux
fields. They are not quite identical, but you can use them to
make the same sorts of calculations that you can for Linux. In particular, I believe that an almost
completely accurate value for the average service time for ZFS pool
avgtime = (rlentime + wlentime) / (reads + writes)
The important difference between the ZFS
*lentime metrics and
wuse is that Linux's times include only
completed IOs, while the ZFS numbers also include the running time
for currently outstanding IOs (which are not counted in
writes). However, much of the time this is only going to be a
small difference and so the 'average service time' you calculate
will be almost completely right. This is especially true if you're
doing this over a relatively long time span compared to the actual
typical service time, and if there's been lots of IO over that time.
When there is an error, you're going to get an average service time that is higher than it really should be. This is not a terribly bad problem; it's at least not hiding issues by appearing too low.
'Scanned' versus 'issued' numbers for ZFS scrubs (and resilvers)
Sufficiently recent versions of ZFS have new '
zpool status' output
during scrubs and resilvers. The traditional old output looks like:
scan: scrub in progress since Sat Feb 9 18:30:40 2019 125G scanned out of 1.74T at 1.34G/s, 0h20m to go 0B repaired, 7.02% done
(As you can probably tell from the IO rate, this is a SSD-based pool.)
The new output adds an additional '<X> issued at <RATE>' note in the second line, and in fact you can get some very interesting output in it:
scan: scrub in progress since Sat Feb 9 18:36:33 2019 215G scanned at 2.24G/s, 27.6G issued at 294M/s, 215G total 0B repaired, 12.80% done, 0 days 00:10:54 to go
Or (with just the important line):
271G scanned at 910M/s, 14.5G issued at 48.6M/s, 271G total
In both cases, this claims to have 'scanned' the entire pool but has only 'issued' a much smaller amount of IO. As it turns out, this is a glaring clue as to what is going on, which is that these are the new sequential scrubs in action. Sequential scrubs (and resilvers) split the non-sequential process of scanning the pool into two sides, scanning through metadata to figure out what IOs to issue and then, separately, issuing the IOs after they have been sorted into order (I am pulling this from this presentation, via). A longer discussion of this is in the comment at the start of ZFS on Linux's dsl_scan.c.
This split is what the new 'issued' number is telling you about.
In sequential scrubs and resilvers, 'scanned' is how much metadata
and data ZFS has been able to consider and queue up IO for, while
'issued' is how much IO has been actively queued to vdevs. Note
that it is not physical IO; instead it is progress through what
zpool list' reports as
ALLOC space, as covered in my entry
on ZFS scrub rates and speeds.
(All of these pools I'm showing output from use mirrored vdevs, so the actual physical IO is twice the 'issued' figures.)
As we can see from these examples, it is possible for ZFS to completely 'scan' your pool before issuing much IO. This is generally going to require that your pool is relatively small and also that you have a reasonable amount of memory, because ZFS limits how much memory it will use for all of those lists of not yet issued IOs that it is sorting into order. Once your pool is fully scanned, the reported scan rate will steadily decay, because it's computed based on the total time the scrub or resilver has been running, not the amount of time that ZFS took to hit 100% scanned.
(In the current ZFS on Linux code, this memory limit appears to be a per-pool one. On the one hand this means that you can scan several pools at once without one pool limiting the others. On the other hand, this means that scanning multiple pools at once may use more memory than you're expecting.)
Sequential scrubs and resilvers are in FreeBSD 12 and will appear in ZFS on Linux 0.8.0 whenever that is released (ZoL is currently at 0.8.0-rc3). It doesn't seem to be in Illumos yet, somewhat to my surprise.
A bit of Sun's history that still lingers on in Illumos
uname command (and system call)
exist to give you various information about the machine you're on.
For example, what Unix it runs, which is handy if you have scripts
(or programs) that are run on multiple Unixes where you want to do
The result from '
uname -s', the name of the operating system, is
pretty straightforward (unlike some of the other uname options; go
ahead, try to guess what '
uname -i' is going to give you on a
random Unix). On FreeBSD you get
FreeBSD, on OpenBSD you get
OpenBSD, on Linux you get
Linux or, if you insist with '
GNU/Linux. On OmniOS and in fact any Illumos system, well:
$ uname -s SunOS
Once upon a time there was Sun Microsystems, who made some of the first Unix workstations. Their Unix was a version of BSD Unix, and like basically every early Unix company they couldn't actually call it 'Unix' for various reasons. So they called it SunOS, and it had a storied history that is too long to cover here (especially SunOS 3.x and 4.x). It of course identified itself as 'SunOS' in various things, because that was its name.
In the early 1990s, Sun changed the name of their Unix from SunOS
to Solaris at the same time as they replaced the code base with one
based on System V Release 4 (which they had
had a hand in creating). Okay, officially 'SunOS 5' was there as a
component of this Solaris thing, but good luck finding much mention
of that or very many people who considered 'SunOS 5' to be a
continuation of SunOS. However, '
uname -s' (still) reported
SunOS', possibly because of that marketing decision.
(I'm not sure if SunOS 3 or SunOS 4 had a
uname command, since
it came from System V. By the way, this history of 'SunOS 5' being
the base component of Solaris is probably why '
uname -r' reports
the release of Illumos as '
5.11' instead of '
Once the early versions of Solaris reported themselves to be
SunOS', Sun was stuck with it in the name of backward compatibility.
Scripts and programs that wanted to check for Solaris knew to check
for a OS name of
SunOS and then a '
uname -r' of 5.* (and
as SunOS 4 faded away people stopped bothering with the second
check); changing the reported operating system name would break
them all. No one was going to do that, especially not Sun.
When OpenSolaris spawned from Solaris, of course the '
output had to stay the same. When OpenSolaris became Illumos, the
same thing was true. And so today, our OmniOS machines cheerfully
tell us that they're running SunOS, an operating system name that
is now more than 30 years old. It's a last lingering trace of a
company that changed the world.
(In Illumos, this is hard-coded in uts/common/os/vers.c.)
(I was reminded of all of this recently as I was changing one of
our fileserver management scripts so that it would refuse to run
on anything except our OmniOS fileservers. Checking '
is really the correct way to do this, which caused me to actually
run it on our OmniOS machines for the first time in a while.)