A ZFS resilver can be almost as good as a scrub, but not quite
We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.
I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.
ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.
For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.
(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)
Our plan for handling TRIM'ing our ZFS fileserver SSDs
The versions of ZFS that we're running on our fileservers (both
the old and the new) don't support using
on drives in ZFS pools. Support for
TRIM has been in FreeBSD ZFS
for a while,
but it only just landed in the ZFS on Linux development version
and it's not in Illumos. Given our general upgrade plans, we're also not likely to
TRIM support over the likely production lifetime of our current
ZFS SSDs through upgrading the OS and ZFS versions later. So you
might wonder what our plans are to deal with how SSD performance
can decrease when they think they're all filled up, if you don't
TRIM them or otherwise deallocate blocks every so often.
Honestly, the first part of our plan is to ignore the issue unless we see signs of performance problems. This is not ideal but it is the simplest approach. It's reasonably likely that our ZFS fileservers will be more limited by NFS and networking than by SSD performance, and as far as I understand things, nominally full SSDs mostly suffer from write performance issues, not read performance. Our current view (only somewhat informed by actual data) is that our read volume is significantly higher than our write volume. We certainly aren't currently planning any sort of routine preventative work here, and we wouldn't unless we saw problem signs.
If we do see problems signs and do need to clear SSDs, our plan is
to do the obvious brute force thing in a ZFS setup with redundancy.
Rather than try to
TRIM SSDs in place, we'll entirely spare out
a given SSD so that it has no live data on it, and then completely
clear it, probably using Linux's
blkdiscard. We might do this in place on
a production fileserver, or we might go to the extra precaution of
pulling the SSD out entirely, swapping in a freshly cleared one,
and clearing the old SSD on a separate machine. Doing this swap has
the twin advantages that we're not risking accidentally clearing
the wrong SSD on the fileserver and we don't have to worry about
the effects of an extra-long, extra-slow SATA command on the rest
of the system and the other drives.
(This plan, such as it is, is not really new with our current generation Linux fileservers. We've had one OmniOS fileserver that used SSDs for a few special pools, and this was always our plan for dealing with any clear problems due to the SSDs slowing down due to being full up. We haven't had to use it, but then we haven't really gone looking for performance problems with its SSDs. They seem to still run fast enough after four or more years, and so far that's good enough for us.)
Drifting away from OmniOS (CE)
Toward the end of last year (2018), the OmniOS CE people got around to migrating the OmniOS user mailing list from its old home on OmniTI's infrastructure to a new home. When they did this, they opted not to move over the existing list membership; instead, people who were still interested had to actively subscribe themselves to the new mailing list. At first, when I got the notice about this I thought I'd subscribe to the new list. Then I thought about it a bit more and quietly let my subscription to omnios-discuss tacitly lapse when the old mailing lists were completely decommissioned at the end of the year.
The reality is that while we still run our OmniOS fileservers, this is only because our migration from them to our next generation of servers is a slow process. We have been quietly drifting away from OmniOS ever since we made the decision to use Linux instead in our next generation, and that has only sped up now that we have new fileservers in production. Our OmniOS machines are now in a de facto 'end of life' maintenance mode; we touch them as little as possible, and if they were to develop problems our response would be to accelerate the migration of filesystems away from them.
(On top of that, my ability to contribute to omnios-discuss has been tenuous in general for some time. Partly this is because we are so far behind in OmniOS versions (we're still on r151014, and yes we know that is well out of support at this point), and partly this is because my OmniOS knowledge is rusting away from disuse. The code for my DTrace scripts is increasingly a foreign land, for example (although I remember how to use them and we still rely on them for diagnostics at times).)
I feel sentimentally sad about this. Although we only ran it for one generation of fileservers, which will amount to five years or so by the time we're done, OmniOS itself was mostly quite good for us and the OmniTI and OmniOS people on omnios-discuss were great. It was a good experience, even though we paid a price for choosing OmniOS, and I'm still reasonably convinced that it was our best choice at the time we made it.
(I'll feel more sentimental when we turn off the first OmniOS ex-production machine, and again when the last one goes out of production, as our Solaris 10 machines eventually did. We'll be lucky if that happens before the end of summer, though.)
A bit more on ZFS's per-pool performance statistics
In my entry on ZFS's per-pool stats, I said:
In terms of Linux disk IO stats, the
*timestats are the equivalent of the
usestat, and the
*lentimestats are the equivalent of the
aveqfield. There is no equivalent of the Linux
wusefields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.
The discussion of the
*lentime stats in the manpage and the relevant header
are very complicated and abstruse. I am sure they make sense to
people for whom the phrase 'a Rieman sum' is perfectly natural,
but I am not such a person.
Having ground through a certain amount of arguments with myself and
experimentation, I now believe that the ZFS
are functionally equivalent to the Linux
fields. They are not quite identical, but you can use them to
make the same sorts of calculations that you can for Linux. In particular, I believe that an almost
completely accurate value for the average service time for ZFS pool
avgtime = (rlentime + wlentime) / (reads + writes)
The important difference between the ZFS
*lentime metrics and
wuse is that Linux's times include only
completed IOs, while the ZFS numbers also include the running time
for currently outstanding IOs (which are not counted in
writes). However, much of the time this is only going to be a
small difference and so the 'average service time' you calculate
will be almost completely right. This is especially true if you're
doing this over a relatively long time span compared to the actual
typical service time, and if there's been lots of IO over that time.
When there is an error, you're going to get an average service time that is higher than it really should be. This is not a terribly bad problem; it's at least not hiding issues by appearing too low.
'Scanned' versus 'issued' numbers for ZFS scrubs (and resilvers)
Sufficiently recent versions of ZFS have new '
zpool status' output
during scrubs and resilvers. The traditional old output looks like:
scan: scrub in progress since Sat Feb 9 18:30:40 2019 125G scanned out of 1.74T at 1.34G/s, 0h20m to go 0B repaired, 7.02% done
(As you can probably tell from the IO rate, this is a SSD-based pool.)
The new output adds an additional '<X> issued at <RATE>' note in the second line, and in fact you can get some very interesting output in it:
scan: scrub in progress since Sat Feb 9 18:36:33 2019 215G scanned at 2.24G/s, 27.6G issued at 294M/s, 215G total 0B repaired, 12.80% done, 0 days 00:10:54 to go
Or (with just the important line):
271G scanned at 910M/s, 14.5G issued at 48.6M/s, 271G total
In both cases, this claims to have 'scanned' the entire pool but has only 'issued' a much smaller amount of IO. As it turns out, this is a glaring clue as to what is going on, which is that these are the new sequential scrubs in action. Sequential scrubs (and resilvers) split the non-sequential process of scanning the pool into two sides, scanning through metadata to figure out what IOs to issue and then, separately, issuing the IOs after they have been sorted into order (I am pulling this from this presentation, via). A longer discussion of this is in the comment at the start of ZFS on Linux's dsl_scan.c.
This split is what the new 'issued' number is telling you about.
In sequential scrubs and resilvers, 'scanned' is how much metadata
and data ZFS has been able to consider and queue up IO for, while
'issued' is how much IO has been actively queued to vdevs. Note
that it is not physical IO; instead it is progress through what
zpool list' reports as
ALLOC space, as covered in my entry
on ZFS scrub rates and speeds.
(All of these pools I'm showing output from use mirrored vdevs, so the actual physical IO is twice the 'issued' figures.)
As we can see from these examples, it is possible for ZFS to completely 'scan' your pool before issuing much IO. This is generally going to require that your pool is relatively small and also that you have a reasonable amount of memory, because ZFS limits how much memory it will use for all of those lists of not yet issued IOs that it is sorting into order. Once your pool is fully scanned, the reported scan rate will steadily decay, because it's computed based on the total time the scrub or resilver has been running, not the amount of time that ZFS took to hit 100% scanned.
(In the current ZFS on Linux code, this memory limit appears to be a per-pool one. On the one hand this means that you can scan several pools at once without one pool limiting the others. On the other hand, this means that scanning multiple pools at once may use more memory than you're expecting.)
Sequential scrubs and resilvers are in FreeBSD 12 and will appear in ZFS on Linux 0.8.0 whenever that is released (ZoL is currently at 0.8.0-rc3). It doesn't seem to be in Illumos yet, somewhat to my surprise.
A bit of Sun's history that still lingers on in Illumos
uname command (and system call)
exist to give you various information about the machine you're on.
For example, what Unix it runs, which is handy if you have scripts
(or programs) that are run on multiple Unixes where you want to do
The result from '
uname -s', the name of the operating system, is
pretty straightforward (unlike some of the other uname options; go
ahead, try to guess what '
uname -i' is going to give you on a
random Unix). On FreeBSD you get
FreeBSD, on OpenBSD you get
OpenBSD, on Linux you get
Linux or, if you insist with '
GNU/Linux. On OmniOS and in fact any Illumos system, well:
$ uname -s SunOS
Once upon a time there was Sun Microsystems, who made some of the first Unix workstations. Their Unix was a version of BSD Unix, and like basically every early Unix company they couldn't actually call it 'Unix' for various reasons. So they called it SunOS, and it had a storied history that is too long to cover here (especially SunOS 3.x and 4.x). It of course identified itself as 'SunOS' in various things, because that was its name.
In the early 1990s, Sun changed the name of their Unix from SunOS
to Solaris at the same time as they replaced the code base with one
based on System V Release 4 (which they had
had a hand in creating). Okay, officially 'SunOS 5' was there as a
component of this Solaris thing, but good luck finding much mention
of that or very many people who considered 'SunOS 5' to be a
continuation of SunOS. However, '
uname -s' (still) reported
SunOS', possibly because of that marketing decision.
(I'm not sure if SunOS 3 or SunOS 4 had a
uname command, since
it came from System V. By the way, this history of 'SunOS 5' being
the base component of Solaris is probably why '
uname -r' reports
the release of Illumos as '
5.11' instead of '
Once the early versions of Solaris reported themselves to be
SunOS', Sun was stuck with it in the name of backward compatibility.
Scripts and programs that wanted to check for Solaris knew to check
for a OS name of
SunOS and then a '
uname -r' of 5.* (and
as SunOS 4 faded away people stopped bothering with the second
check); changing the reported operating system name would break
them all. No one was going to do that, especially not Sun.
When OpenSolaris spawned from Solaris, of course the '
output had to stay the same. When OpenSolaris became Illumos, the
same thing was true. And so today, our OmniOS machines cheerfully
tell us that they're running SunOS, an operating system name that
is now more than 30 years old. It's a last lingering trace of a
company that changed the world.
(In Illumos, this is hard-coded in uts/common/os/vers.c.)
(I was reminded of all of this recently as I was changing one of
our fileserver management scripts so that it would refuse to run
on anything except our OmniOS fileservers. Checking '
is really the correct way to do this, which caused me to actually
run it on our OmniOS machines for the first time in a while.)
The potential risk to ZFS created by the shift in its userbase
The obvious conclusion we can draw from FreeBSD ZFS's shift to being based on ZFS on Linux is that the center of gravity of open source ZFS development has shifted to ZFS on Linux. FreeBSD ZFS is shifting its upstream because ZFS on Linux is increasingly where new development happens. A more imprecise and deeper conclusion is that in some sense, the ZFS userbase as a whole is increasingly shifting to Linux.
(It may be that the total numbers of people using ZFS on FreeBSD and Illumos is higher than the people using ZFS on Linux. But if so, the users of FreeBSD and Illumos ZFS don't seem to drive development in the way that happens with ZFS on Linux. There are many possible explanations for this, of course, because there are many factors involved.)
Unfortunately, I think that this shift creates risk due to a consequence of ZFS on Linux not being GPL-compatible, which is that working with the Linux kernel's ever-changing API could become sufficiently much of a problem that ZFS on Linux development could end up coming to a halt. If ZFS as a whole is increasingly dependent on ZFS on Linux development and ZFS on Linux development is partly a hostage to the not particularly friendly views of Linux kernel developers, a problem there affects not just ZFS on Linux but everyone.
At least some of the users of ZFS on Linux are OS-agnostic and are using Linux because they feel it is their best overall choice (that's certainly our view for our new fileservers). If ZoL ground to a halt and stopped being viable, they'd switch over to other OSes (FreeBSD for us). But the question there is whether they'd bring the development resources with them, so that people would do and fund more ZFS development on non-Linux platforms, or if some of the work that is currently being done to ZoL would basically just evaporate.
(One of the things I wonder about is if ZFS on Linux's existence outside the kernel has helped drive development of it. If you're interested in a new ZoL feature, you can develop or fund it on your own schedule, then probably deploy it even on old Linux kernels such as those found in the various 'long term support' Linuxes (because ZoL as a whole is careful to support them). In FreeBSD and Illumos, I think you're much more locked to the overall kernel version and thus the OS version; if you fund a new ZFS feature, you could wind up needing to deploy a whole new OS release to get what you want. If nothing else, this is a longer deployment cycle.)
Given this risk, I definitely hope that people keep doing ZFS work (and funding it) outside of ZFS on Linux. But unfortunately I'm not sure how likely that is. It's quite possible that there are plenty of other people like us, making a pragmatic choice that ZoL is currently good enough and means we can use Linux instead of having to build out and maintain another OS.
(Illumos is not viable for us today for various reasons. FreeBSD might have been, but we didn't try to evaluate it since Linux worked well enough and we already run lots of other Linux machines.)
Some things on ZFS's per-pool performance statistics
Somewhat to my surprise, I recently found out that ZFS has had basic
per-pool activity and performance statistics for a while (they're
old enough that they're in our version of OmniOS, which is not
exactly current these days). On sufficiently modern versions of ZFS
(currently only the development version of ZFS on Linux), you can even get a small subset of these
per-pool stats for each separate dataset, which may be useful for
tracking activity. To be clear, these are not the stats that are
made visible through '
zpool iostat'; these are a separate set of
stats that are visible through (k)stats, and which can be picked
up and tracked by at least some performance metrics systems on some
(Specifically, I know that Prometheus's host agent can collect them from
ZFS on Linux. In theory you could add
collecting them on OmniOS to the agent, perhaps using my Go kstat
package, but someone would
have to go to that programming work and I don't know if the Prometheus
people would accept it. I haven't looked at other metrics host
agents to see if they can collect this information on OmniOS or
other Illumos systems. I suspect that no host agent can collect the
more detailed '
zpool iostat' statistics because as far as I know
there's no documented API to obtain them programatically.)
The best description of the available stats is in the kstat(3kstat) manpage, in the description of 'I/O Statistics' (for ZFS on Linux, you want to see here). Most of the stats are relatively obvious, but there's two important things to note for them. First, as far as I know (and can tell), these are for direct IO to disk, not for user level IO. This shows up in particular for writes if you have any level of vdev redundancy. Since we use mirrors, the amount written is basically twice the user level write rate; if a user process is writing at 100 MB/sec, we see 200 MB/sec of writes. This is honest but a little bit confusing for keeping track of user activity.
The second is that the
stats are not distinguishing between read and write IO, but between
'run' and 'wait' IO. This is best explained in a comment from the
Illumos kstat manpage:
A large number of I/O subsystems have at least two basic "lists" of transactions they manage: one for transactions that have been accepted for processing but for which processing has yet to begin, and one for transactions which are actively being processed (but not done). For this reason, two cumulative time statistics are defined here: pre-service (wait) time, and service (run) time.
I don't know enough about the ZFS IO queue code to be able to tell you if a ZFS IO being in the 'run' state only happens when it's actively submitted to the disk (or other device), or if ZFS has some internal division. The ZFS code does appear to consider IO 'active' at the same point as it makes it 'running', and based on things I've read about the ZIO scheduler I think this is probably at least close to 'it was issued to the device'.
(On Linux, 'issued to the device' really means 'put in the block IO system'. This may or may not result in it being immediately issued to the device, depending on various factors, including how much IO you're allowing ZFS to push to the device.)
If you have only moderate activity, it's routine to have little or no 'wait' (w) activity or time, with most of the overall time for request handling being in the 'run' time. You will see 'wait' time (and queue sizes) rise as your ZFS pool IO level rises, but before then you can have an interesting and pretty pattern where your average 'run' time is a couple of milliseconds or higher but your average 'wait' time is in the microseconds.
In terms of Linux disk IO stats, the
*time stats are the equivalent of the
use stat, and the
*lentime stats are the equivalent of the
aveq field. There
is no equivalent of the Linux
wuse fields, ie no
field that gives you the total time taken by all completed 'wait'
or 'run' IO. I think that there's ways to calculate much of the
same information you can get for Linux disk IO
from what ZFS (k)stats give you, but that's another entry.
For ZFS datasets, you currently get only
nwritten. For datasets, the writes appear to be
user-level writes, not low level disk IO writes, so they will track
closely with the amount of data written at the user level (or at
the level of an NFS server). As I write this here in early 2019,
these per-dataset stats aren't in any released version of even ZFS
on Linux, but I expect to see them start showing up in various
places (such as FreeBSD) before too many
years go by.
PS: I regret not knowing that these stats existed some time ago,
because I probably would have hacked together something to look at
them on our OmniOS machines, even though
we never used '
zpool iostat' very much for troubleshooting for
various reasons. In general, if you have multiple ZFS pools it's
always useful to be able to see what the active things are at the
Some notes on ZFS prefetch related stats
For reasons beyond the scope of this entry, I've recently been looking at ARC stats, in part using a handy analysis program by Richard Elling. This has gotten me looking again at ZFS (k)stats related the prefetching, which I touched on before in my entry on some basic ZFS ARC statistics and prefetching. So here are some notes on what I think these mean or might mean.
To be able to successfully prefetch at all, ZFS needs to recognize and predict your access pattern. The extent to which it can do this is visible in the ZFS DMU zfetchstats kstats; zfetchstats:hits is the number of reads that matched a prediction stream, while zfetchstats:misses is the number of reads that did not match one. If zfetchstats:hits is low, there are two possible reasons; you could have a mostly random IO pattern, or you could have too many different sequential streams reading from the same file(s) at once. In theory there is a kstat that counts 'you had too many streams for this file and I couldn't create a new one', zfetchstats:max_streams. In practice this seems to be useless and you can't really tell these cases apart, because as far as I can tell even random access to files creates ZFS prefetch streams.
Every file can have at most
zfetch_max_streams streams (default
8), and even streams that have never matched any reads aren't removed
zfetch_min_sec_reap seconds (default 2). So when you start
doing random reads to a new file, as far as I can tell your first
8 random reads will immediately create 8 DMU prefetch streams and
then every read after that will still try to create a new one but
fail because you've hit the maximum stream count for the file. Since
the streams are maxed out, each new random read will increment both
zfetchstats:misses (since it doesn't match any existing stream) and
zfetchstats:max_streams (since the file has 8 streams). Every
two seconds, your current streams expire and you get 8 new ones
from the next 8 random reads you do.
(This theory matches the numbers I see when I produce a flood of
random reads to a large file with
ioping. Our ZFS fileservers do
show a slowly growing difference between the two stats.)
As discussed in my previous entry, the ARC 'prefetch hits' statistics count only how many prefetch reads were found in the ARC instead of needing to be read from disk. A high prefetch ARC hit rate means that you're doing sequential reads of files that are already in the ARC and staying there (either because you've read them before or because you recently wrote them). A low prefetch ARC hit rate means that this isn't happening, but there are multiple reasons for this. Obviously, one cause is that your sequential re-reads are collectively too large for your ARC and so at least some of them are being evicted before you re-read them. Another cause is that you're mostly not re-reading things, at least not very soon; most of the time you read a file once and then move on.
If you know or believe that your workload should be in the ARC, a low ARC prefetch hit rate or more exactly a high ARC prefetch miss count is a sign that something is off, since it means that your prefetch reads are not finding things in the ARC that you expect to be there. A low ARC prefetch hit rate is not necessarily otherwise a problem.
I believe that there are situations where you will naturally get a
low ARC prefetch hit rate. For example, if you perform a full backup
of a number of ZFS filesystems with
tar, I would expect a lot of
ARC prefetch misses, since it's unlikely that you can fit all the
data from all of your filesystems into ARC. And this is in fact the
pattern we see on our ZFS fileservers during our Amanda backups.
On the other hand, you should see a lot of ARC demand data hits,
since prefetching itself should be very successful (and this is
also the pattern we see).
FreeBSD ZFS will be changing to be based on ZFS on Linux
The big news today is contained in an email thread called The future of ZFS in FreeBSD (via any number of places), where the crucial summary is:
[...] This state of affairs has led to a general agreement among the stakeholders that I have spoken to that it makes sense to rebase FreeBSD's ZFS on [ZFS on Linux]. [...]
This is not a development that I expected, to put it one way. Allan Jude had some commentary on OpenZFS in light of this that's worth reading for interested parties. There's also a discussion of this with useful information on lobse.rs, including this comment from Joshua M. Clulow mentioning that Illumos developers are likely to be working to upstream desirable ZFS on Linux changes.
(I had also not known that Delphix is apparently moving away from Illumos.)
I've used ZFS on Linux for a fairly long time now, and we settled on it for our next generation of fileservers, but all during that time I was under the vague impression that ZFS on Linux was the red-headed stepchild of ZFS implementations. Finding out that it's actually a significant source of active development is a bit of a surprise. On the one hand, it's a pleasant surprise, and it definitely helps me feel happy about our choice of ZFS on Linux. On the other hand it makes me feel a bit sad about Illumos, because upstream ZFS development in Illumos now appears to be yet another thing that is getting run over by the Linux juggernaut. Even for Linux people, that is not entirely good news.
(And, well, we're part of that Linux juggernaut, since we chose ZFS on Linux instead of OmniOS CE. For good reasons, sadly.)
On the other hand, if this leads to more feature commonality and features becoming available sooner in FreeBSD, Illumos, and ZFS on Linux, it will be a definite win. Even if it simply causes everyone to wind up with an up to date test suite with shared tests for common features that run regularly, that's probably a general win (and certainly a good thing).
(One thing that it may do in the short term is propagate FreeBSD's
support for ZFS sending
TRIM commands to disks into ZFS on Linux,
which would be useful for us in the long run.)
PS: The fundamental problem with ZFS on Linux remains that ZFS on Linux will almost certainly never be included in the Linux kernel. This means that it will always be a secondary filesystem on Linux. As a result I think it's not really a great thing if ZFS on Linux becomes the major source of ZFS development; the whole situation with Linux and ZFS is always going to be somewhat precarious, and things could always go wrong. It's simply healthier for ZFS overall if significant ZFS development continues on a Unix where ZFS is a first class citizen, ideally fully integrated into the kernel and fully supported.