Some basic ZFS ARC statistics and prefetching
I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?
(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)
If you read the
or look at the raw ARC kstats, you'll very soon notice things like
'prefetch hits percentage' arcstat field and a prefetch_data_hits
kstat. Unfortunately these prefetch-related kstats will not give
us what we want, because they mean something different that what
you might expect.
The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.
(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)
All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:
- The ARC demand hit rate (for data and metadata together) is probably
the most important thing for whether the ARC is giving you good
results, although this partly depends on the absolute volume of
demand reads and a few other things. Demand misses mean that
programs are waiting on disk IO.
- The breakdown of *_miss kstats will tell you why ZFS is
reading things from disk. You would generally like this to
be prefetch reads instead of demand reads, because at least
things aren't waiting on prefetch reads.
- The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.
One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.
Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:
- demand_hit_predictive_prefetch counts demand reads that found
data in the ARC from prefetch reads. This sounds exactly like what
we want, but experimentally it doesn't seem to come anywhere near
fully accounting for hits on prefetched data; I see low rates of
it when I am also seeing a 100% demand hit rate for sequentially
read data that was not previously in the ARC.
- sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.
My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.
If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.
If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.
One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.
(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)
Today I (re-)learned that
top's output can be quietly system dependent
I'll start with a story that is the background. A few days ago I tweeted:
Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.
(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)
Try adding '-c email@example.com' to your SSH invocation.
See also: <pdf link>
(If you care about 10G+ SSH, you want to read that PDF.)
This made a huge difference, giving
me basically 1G wire speeds for my ZFS transfers. But that difference
made me scratch my head, because why was switching SSH ciphers
making a difference when
ssh wasn't CPU-limited in the first
place? I came up with various theories and guesses, until today I
had a sudden terrible suspicion. The result of testing and confirming
that suspicion was another tweet:
Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.
(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)
You see, the reason that I had confidently known that SSH wasn't
CPU-limited on sending machine, which was one of our OmniOS
fileservers, is that I had run
seen that the
ssh process was only using 25% of the CPU. Case
Except that OmniOS
top and Linux's
top report CPU usage percentages
differently. On Linux, CPU percentage is relative to a single CPU,
so 25% is a quarter of one CPU, 100% is all of it, and over 100%
is a multi-threaded program that is using up more than one CPU's
worth of CPU time. On OmniOS, the version of
top we're using comes
from pkgsrc (in what is by now a very
old version), and that version reports CPU percentage relative to
all CPUs in the machine. Our OmniOS fileservers are 4-CPU
so that '25% CPU' was actually 'all of a single CPU'. In other words,
I was completely wrong about the sending
ssh not being CPU-limited.
ssh was CPU limited after all, it's suddenly no surprise why
switching ciphers sped things up to basically wire speed.
(Years ago I established that the old SunSSH that OmniOS was using
back then was rather slow, but then later we
upgraded to OpenSSH and I sort of thought that
I could not worry about SSH speeds any more. Well, I was wrong. Of
course, nothing can beat not doing SSH at all but instead using, say,
also means that you can deliberately limit your transfer bandwidth
to leave some room for things like NFS fileservice.)
PS: There are apparently more versions than you might think. On the FreeBSD
10.4 machine I have access to,
top reports CPU percentage in the
same way Linux does (100% is a single-threaded process using all
of one CPU). Although both the FreeBSD version and our OmniOS version
say they're the William LeFebvre implementation and have similar
version numbers, apparently they diverged significantly at some
point, probably when people had to start figuring out how to make
the original version of
top deal with multi-CPU machines.
Some views on more flexible (Illumos) kernel crash dumps
In my notes about Illumos kernel crash dumps, I mentioned that we've now turned them off on our OmniOS fileservers. One of the reasons for this is that we're running an unsupported version of OmniOS, including the kernel. But even if we were running the latest OmniOS CE and had commercial support, we'd do the same thing (at least by default, outside of special circumstances). The core problem is that our needs conflict with what Illumos crash dumps want to give us right now.
The current implementation of kernel crash dumps basically prioritizes
capturing complete information. There are various manifestations
of this in the implementation, starting with how it assumes that
if crash dumps are configured at all, you have set up enough disk
space to hold the full crash dump level you've set in
dumpadm, so it's sensible to not bother
checking if the dump will fit and treating failure to fit as an
unusual situation that is not worth doing much special about. Another
one is the missing feature that there is no overall time limit on
how long the crash dump will run, which is perfectly sensible if the
most important thing is to capture the crash dump for diagnosis.
But, well, the most important thing is not always to capture complete diagnostic information. Sometimes you need to get things back into service before too long, so what you really want is to capture as much information as possible while still returning to service in a certain amount of time. Sometimes you only have so much disk space available for crash dumps, and you would like to capture whatever information can fit in that disk space, and if not everything fits it would be nice if the most important things were definitely captured.
All of this makes me wish that Illumos kernel crash dumps wrote certain critical information immediately, at the start of the crash dump, and then progressively extended the information in the crash dump until they either ran out of space or ran out of time. What do I consider critical? My first approximation would be, in order, the kernel panic, the recent kernel messages, the kernel stack of the panicing kernel process, and the kernel stacks of all processes. Probably you'd also want anything recent from the kernel fault manager.
The current Illumos crash dump code does have an order for what gets written out, and it does put some of this stuff into the dump header, but as far as I can tell the dump header only gets written at the end. It's possible that you could create a version of this incremental dump approach by simply writing out the incomplete dump header every so often (appropriately marked with how it's incomplete). There's also a 'dump summary' that gets written at the end that appears to contain a bunch of this information; perhaps a preliminary copy could be written at the start of the dump, then overwritten at the end if the dump is complete. Generally what seems to take all the time (and space) with our dumps is the main page writing stuff, not a bunch of preliminary stuff, so I think Illumos could definitely write at least one chunk of useful information before it bogs down. And if this needs extra space in the dump device, I would gladly sacrifice a few megabytes to have such useful information always present.
(It appears that the Illumos kernel already keeps a lot of ZFS data
memory out of kernel crash dumps, both for the ARC and for in-flight
ZFS IO, so I'm not sure what memory the kernel is spending all of
its time dumping in our case. Possibly we have a lot of ZFS metadata,
which apparently does go into crash dumps. See the comments about
crash dumps in abd.c
For the 'dump summary', see the
dump_messages functions in dumpsubr.c.)
PS: All of this is sort of wishing from the sidelines, since our future is not with Illumos.
Some notes about kernel crash dumps in Illumos
On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).
We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).
The resulting Twitter discussion got me to take a look into the
current state of the code for this in Illumos, and I wound up
discovering some potentially interesting things. First off, dump
settings are not auto-loaded or auto-saved by the kernel in some
magical way; instead
dumpadm saves all of your configuration
/etc/dumpadm.conf and then sets them during boot
through svc:/system/dumpadm:default. The
page will tell you all of
this if you read its description of the
-z argument to
dumpadm is inadequately described in
the manual page. The 'crash
dump compression' it's talking about is whether
write compressed dumps; it has nothing to do with how the kernel
writes out the crash dump to your configured crash device. In fact,
dumpadm has no direct control over basically any of that process;
if you want to change things about the kernel dump process, you
need to set kernel variables through
/etc/system (or '
The kernel writes crash dumps in multiple steps. If your console
shows the message '
dumping to <something>, offset NNN, contents:
<...>', then you've at least reached the start of writing out the
crash dump. If you see updates of the form '
dumping: MM:SS N%
done', the kernel has reached the main writeout loop and is writing
out pages of memory, perhaps excessively slowly. As far as I can
tell from the code, crash dumps don't abort when they run out of
space on the dump device; they keep processing things and just
throw all of the work away.
As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.
(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)
I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.
If you examine the source of dumpsubr.c,
you'll discover a tempting variable
dump_timeout that's set to
120 (seconds) and described as 'timeout for dumping pages'. This
comment is a little bit misleading, as usual; what it really means
is 'timeout for dumping a single set of pages'. There is no limit
on how long the kernel is willing to keep writing out pages for,
provided that it makes enough progress within 120 seconds. In our
case this is unfortunate, since we'd be willing to spend a few
minutes to gather a bit of crash information but not anything like
what a kernel dump appears to take on our machines.
(The good news is that if you run out of space on your dump device,
the dump code is at least smart enough to not spend any more time
trying to compress pages; it just throws them away right away. You
might run out of space because you're taking a panic dump from a
ZFS fileserver with 128 GB of RAM and
putting it on an 8GB dump zvol that is part of a rpool that lives
on 80 GB SSDs, where a full-sized kernel dump almost certainly can't
even be saved by
PS: To see that the default is still a single-threaded crash dump,
you need to chase through the code to dumphdr.h
and the various
DUMP_PLAT_*_MINCPU definitions, all of which
are set to 0. Due to how the code is structured, this disables
multi-threaded dumps entirely.
Sidebar: The theoretical controls for multi-threaded dumps
If you set
dump_plat_mincpu to something above 0, then if you
have 'sufficiently more' CPUs than this, you will get parallel bzip2
compression; below that you will get parallel lzjb. Since parallel
compression is disabled by default in Illumos, this may or may not
actually still work, even if you don't run into any actual bugs of
the sort that caused it to be disabled in the first place. Note
that bzip2 is not fast.
The actual threshold of 'enough' depends on the claimed maximum
transfer size of your disks. For dumping to zvols, it appears that
this maximum transfer size is always 128 KB, which uses a code path
where the breakpoint between parallel lzjb and parallel bzip2 is just
dump_plat_mincpu; if you have that many CPUs or more, you get
bzip2. This implies that you may want to set
dump_plat_mincpu to a
nice high number so that you get parallel lzjb all the time.
You can sort of use
zdb as a substitute for a proper ZFS
One of the things said repeatedly and correctly about ZFS is that
it has no equivalent of
fsck. ZFS scrubs will check that all of
the blocks in your pool checksum correctly and that the pool's
metadata is generally intact, but that's it (as covered yesterday). A ZFS scrub will detect and repair damaged
on-disk data, but it will not do anything about mistakes and accidents
inside ZFS itself, including ACL attributes that are internally
inconsistent. These things are not
supposed to happen but they do, partly because the ZFS code has
bugs (see, for example, Illumos issue 9847).
If you look at a conventional filesystem's
fsck from the right
angle, it does two things; it finds corrupted portions of your
filesystem and tells you about them, then it fixes them for you as
much as it can or at least recovers as much data as possible. ZFS
doesn't have something that does the 'repair' portion of that and
probably never will, but it does have something that does at least
part of the first job, that of scanning your ZFS pool and finding
things wrong with how it is put together. That thing is ZDB.
ZDB started out life as a deliberately undocumented internal (Open)Solaris tool. Back in the days of Solaris 10, the only way to learn how to use it was to either run it to get vague help messages or read the source code (doing both was recommended). I'm not sure when it gained an actual manual page, but it was after we started using Solaris 10 on our first generation of ZFS NFS fileservers, but the manpage has apparently been there for a while now; some spelunking suggests that it may have shown up in early 2012 through Illumos issue 2088. By itself that was welcome, because ZDB is really your only tool to introspect the details of any oddities in your ZFS pools.
However, these days it has become more than just an internal debugging
tool. As suggested by the second paragraph of Illumos issue 9847
and also the ZDB manpage itself, ZDB has become the place that
people put at least some (meta)data consistency checking for ZFS
pools. Right now this appears to just be looking for space leaks
under the right circumstances (as part of '
zdb -b' or '
However in the future it's possible that ZDB may do more consistency
checking if asked, because there's at least the camel's nose in the
tent and ZDB is not a bad place for it.
When I started writing this entry I was optimistically hoping that I'd find various sorts of consistency checking in ZDB. Unfortunately I'm wrong, although I think that you could use ZDB with some add-on tooling to do things like verify that all directory entries in a filesystem referred to live dnodes (since I believe you can dump all dnodes in a ZFS filesystem, including showing the ZAPs for directories; then you could post-process the dump). Possibly the ZFS developers feel that additional offline tools are the best choice for various reasons.
PS: As far as I know ZDB can't be used to repair space leaks and the like, but if you use it to discover a big one at least you know it's time to back up the pool, destroy it, and start over from scratch.
PPS: I continue to strongly believe that ZFS should have something that at least scans your pool for all sorts of correctness and consistency issues, because things keep happening in ZFS code that result in damaged filesystems. But so far no one considers this a high enough priority to develop tools for it, and I suppose I can't blame them; the large system solution to 'my filesystem is corrupted' is 'restore from last night's backups'. Certainly it would be our solution here.
ZFS scrubs check (much) less than you probably think they do
Several years ago I wrote an entry on the limits of what ZFS scrubs check. In that entry I said:
The simple version of what a ZFS scrub does is that it verifies the checksum for every copy of every (active) block in the ZFS pool. It also explicitly verifies parity blocks for RAIDZ vdevs (which a normal error-free read does not). In the process of doing this verification, the scrub must walk the entire object tree of the pool from the top downwards, which has the side effect of more or less verifying this hierarchy; certainly if there's something like a directory entry that points to an invalid thing, you will get a checksum error somewhere in the process.
(The emphasis is new.)
As I wrote this and as people will read it, I am pretty sure that
this is incorrect, because at the time I did not understand how
ZFS filesystems and pools were really structured and how this made
ZFS scrubs fundamentally different from the way that
The straightforward and ordinary way that
fsck programs are written
for conventional filesystems is that they start at the root directory
of the filesystem and follow everything down from there, eventually
looking at every live file and object. In the process they build
up a map of the disk blocks and inodes that are in use and free,
and how many links each inode is supposed to have, and so on, and
they can detect various sorts of inconsistencies in this data.
Because they walk through the entire filesystem directory tree,
they always notice if your directories are corrupt; reading through
your directories is how they figure out what to do next.
ZFS scrubs famously don't verify that various sorts of filesystem
metadata are correct; for example, the ZFS filesystem with bad ACLs
that I mentioned in this entry
passes pool scrubs. But until recently I thought that ZFS scrubs
still traversed your ZFS pool and filesystems in the same way that
fsck did, and in the process they more or less verified the
integrity of your ZFS filesystem directories for the same reason,
because that's how they knew what to visit next. If you had a
corrupt entry that pointed to nothing or to an unallocated dnode
or something, a scrub would either complain or panic (but at least
But ZFS filesystems and ZFS pools are not really organized this way, as I found out when I actually did my research. Instead, each ZFS filesystem is in essence an object set of dnodes plus some extra information. Each dnode is self-contained; given only a block pointer to a dnode, you can completely verify the checksums of all of the dnode's data, without really having to know much about what that data actually means. This means that if all you care about is that the checksums of everything in a filesystem is correct, all you have to do is fetch the filesystem's object set and then verify the checksums of every allocated dnode in it. ZFS doesn't have to walk through the filesystem's directory tree to verify all of its checksums, and I am pretty sure that ZFS scrubs and resilvers don't bother to do so.
As a result, provided that all of the block checksums verify, ZFS scrubs are very likely to be splendidly indifferent to things like what is actually in your filesystem directories and what dnode object numbers your files claim to be and so on. Scrubs need to use and thus verify a bit of the dnode structure simply in order to find all of its data blocks through indirect blocks, but they don't need to even look at a lot of other things associated with dnodes (such as the structure of system attributes). It's possible that verifying the block checksums of filesystem directories requires some analysis of their general structure, but that general structure is generic.
(ZFS filesystem directories are ZAP objects, which are a generic ZFS thing to used to store name/value pairs. You can read through all of the disk blocks of a ZAP object without knowing what the keys and their values mean or if they mean anything, although I think you'll basically verify that the actual hash table structure is correct.)
(What I wrote is potentially technically correct in that there are DSL (Dataset and Snapshot Layer) directories and so on, and scrubs may have to traverse through them to find the object sets of your filesystems (see the discussion in my broad overview of how ZFS is structured on disk). But I didn't even really understand those when I wrote my entry, and I was talking about ZFS filesystem directories.)
How you migrate ZFS filesystems matters
If you want to move a ZFS filesystem around from one host to another,
you have two general approaches; you can use '
zfs send' and '
receive', or you can use a user level copying tool such as
tar -cf | tar -xf', or any number of similar options). Until
recently, I had considered these two approaches to be more or less
equivalent apart from their convenience and speed (which generally
tilted in favour of '
zfs send'). It turns out that this is not
necessarily the case and there are situations where you will want
one instead of the other.
We have had two generations of ZFS fileservers so far, the Solaris
ones and the OmniOS ones.
When we moved from the first generation to the second generation,
we migrated filesystems across using '
zfs send', including the
filesystem with my home directory in it (we did this for various
reasons). Recently I discovered
that some old things in my filesystem didn't have file type
information in their directory entries. ZFS
has been adding file type information to directories for a long
time, but not quite as long as my home directory has been on ZFS.
This illustrates an important difference between the '
approach and the
rsync approach, which is that
zfs send doesn't
update or change at least some ZFS on-disk data structures, in
the way that re-writing them from scratch from user level does.
There are both positives and negatives to this, and a certain amount
of rewriting does happen even in the '
zfs send' case (for example,
all of the block pointers get changed, and ZFS
will re-compress your data as applicable).
I knew that in theory you had to copy things at the user level if
you wanted to make sure that your ZFS filesystem and everything in
it was fully up to date with the latest ZFS features. But I didn't
expect to hit a situation where it mattered in practice until, well,
I did. Now I suspect that old files on our old filesystems may be
partially missing a number of things, and I'm wondering how much
of the various changes in '
zfs upgrade -v' apply even to old data.
(I'd run into this sort of general thing before when I looked into ext3 to ext4 conversion on Linux.)
With all that said, I doubt this will change our plans for migrating our ZFS filesystems in the future (to our third generation fileservers). ZFS sending and receiving is just too convenient, too fast and too reliable to give up. Rsync isn't bad, but it's not the same, and so we only use it when we have to (when we're moving only some of the people in a filesystem instead of all of them, for example).
PS: I was going to try to say something about what '
zfs send' did
and didn't update, but having looked briefly at the code I've
concluded that I need to do more research before running my keyboard
off. In the mean time, you can read the OpenZFS wiki page on ZFS
send and receive,
which has plenty of juicy technical details.
PPS: Since eliminating all-zero blocks is a form of compression, you can turn zero-filled files into sparse files through a ZFS send/receive if the destination has compression enabled. As far as I know, genuine sparse files on the source will stay sparse through a ZFS send/receive even if they're sent to a destination with compression off.
ZFS quietly discards all-zero blocks, but only sometimes
On the ZFS on Linux mailing list, a question came up about whether
ZFS discards writes of all-zero blocks (as you'd get from '
if=/dev/zero of=...'), turning them into holes in your files or,
especially, holes in your zvols. This is especially relevant for
zvols, because if ZFS behaves this way it provides you with a way
of returning a zvol to a sparse state from inside a virtual machine
(or other environment using the zvol):
$ dd if=/dev/zero of=fillfile [... wait for the disk to fill up ...] $ rm -f fillfile
The answer turns out to be that ZFS does discard all-zero blocks
and turn them into holes, but only if you have some sort of compression
turned on (ie, that you don't have the default '
This isn't implemented as part of ZFS ZLE compression (or other
compression methods); instead, it's an entirely separate check that
looks only for an all-zero block and returns a special marker if
that's what it has. As you'd expect, this check is done before ZFS
tries whatever main compression algorithm you set.
Interestingly, there is a special compression level called 'empty'
ZIO_COMPRESS_EMPTY) that only does this special 'discard
zeros' check. You can't set it from user level with something like
compression=empty', but it's used internally in the ZFS code for
a few things. For instance, if you turn off metadata compression
zfs_mdcomp_disable tunable, metadata is still compressed
with this 'empty' compression. Comments in the current ZFS on Linux
source code suggest that ZFS relies on this to do things like discard
blocks in dnode object sets where all the
dnodes in the block are free (which apparently zeroes out the dnode).
There are two consequences of this. The first is that you should
always set at least ZLE compression on zvols, even if their
volblocksize is the same as your pool's
ashift block size and
so they can't otherwise benefit from compression (this would also apply to filesystems
if you set an
recordsize). The second is that it
reinforces how you should basically always turn compression on on
filesystems, even if you think you have mostly incompressible data.
Not only do you save space at the end of files, but you get to drop any all-zero
sections of sparse or pseudo-sparse files.
I took a quick look back through the history of ZFS's code, and as
far as I could see, this zero-block discarding has always been
there, right back to the beginnings of compression (which I believe
came in with ZFS itself).
ZIO_COMPRESS_EMPTY doesn't quite date
back that far; instead, it was introduced along with
zfs_mdcomp_disable, back in 2006.
(All of this is thanks to Gordan Bobic for raising the question in reply to me when I was confidently wrong, which led to me actually looking it up in the code.)
A little bit of the one-time MacOS version still lingers in ZFS
Once upon a time, Apple came very close to releasing ZFS as part of MacOS. Apple did this work in its own copy of the ZFS source base (as far as I know), but the people in Sun knew about it and it turns out that even today there is one little lingering sign of this hoped-for and perhaps prepared-for ZFS port in the ZFS source code. Well, sort of, because it's not quite in code.
Lurking in the function that reads ZFS directories to turn (ZFS) directory entries into the filesystem independent format that the kernel wants is the following comment:
objnum = ZFS_DIRENT_OBJ(zap.za_first_integer); /* * MacOS X can extract the object type here such as: * uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer); */
(Specifically, this is in
zfs_readdir in zfs_vnops.c .)
ZFS maintains file type information in directories. This information can't be used on Solaris
(and thus Illumos), where the overall kernel doesn't have this in
its filesystem independent directory entry format, but it could
have been on MacOS ('Darwin'), because MacOS is among the Unixes
d_type. The comment
itself dates all the way back to this 2007 commit,
which includes the change 'reserve bits in directory entry for file
type', which created the whole setup for this.
I don't know if this file type support was added specifically to help out Apple's MacOS X port of ZFS, but it's certainly possible, and in 2007 it seems likely that this port was at least on the minds of ZFS developers. It's interesting but understandable that FreeBSD didn't seem to have influenced them in the same way, at least as far as comments in the source code go; this file type support is equally useful for FreeBSD, and the FreeBSD ZFS port dates to 2007 too (per this announcement).
Regardless of the exact reason that ZFS picked up maintaining file type information in directory entries, it's quite useful for people on both FreeBSD and Linux that it does so. File type information is useful for any number of things and ZFS filesystems can (and do) provide this information on those Unixes, which helps make ZFS feel like a truly first class filesystem, one that supports all of the expected general system features.
How ZFS maintains file type information in directories
As an aside in yesterday's history of file type information being available in Unix directories, I mentioned that it was possible for a filesystem to support this even though its Unix didn't. By supporting it, I mean that the filesystem maintains this information in its on disk format for directories, even though the rest of the kernel will never ask for it. This is what ZFS does.
(One reason to do this in a filesystem is future-proofing it against a day when your Unix might decide to support this in general; another is if you ever might want the filesystem to be a first class filesystem in another Unix that does support this stuff. In ZFS's case, I suspect that the first motivation was larger than the second one.)
The easiest way to see that ZFS does this is to use
zdb to dump
a directory. I'm going to do this on an OmniOS machine, to make it
more convincing, and it turns out that this has some interesting
results. Since this is OmniOS, we don't have the convenience of
just naming a directory in
zdb, so let's find the root directory
of a filesystem, starting from dnode 1 (as seen before).
# zdb -dddd fs3-corestaff-01/h/281 1 Dataset [....] [...] microzap: 512 bytes, 4 entries [...] ROOT = 3 # zdb -dddd fs3-corestaff-01/h/281 3 Object lvl iblk dblk dsize lsize %full type 3 1 16K 1K 8K 1K 100.00 ZFS directory [...] microzap: 1024 bytes, 8 entries RESTORED = 4396504 (type: Directory) ckstst = 12017 (type: not specified) ckstst3 = 25069 (type: Directory) .demo-file = 5832188 (type: Regular File) .peergroup = 12590 (type: not specified) cks = 5 (type: not specified) cksimap1 = 5247832 (type: Directory) .diskuse = 12016 (type: not specified) ckstst2 = 12535 (type: not specified)
This is actually an old filesystem (it dates from Solaris 10 and
has been transferred around with '
zfs send | zfs recv' since then),
but various home directories for real and test users have been
created in it over time (you can probably guess which one is the
oldest one). Sufficiently old directories and files have no file
type information, but more recent ones have this information,
.demo-file, which I made just now so this would have
an entry that was a regular file with type information.
Once I dug into it, this turned out to be a change introduced (or
activated) in ZFS filesystem version 2, which is described in '
upgrade -v' as 'enhanced directory entries'. As an actual change
in (Open)Solaris, it dates from mid 2007, although I'm not sure
what Solaris release it made it into. The upshot is that if you
made your ZFS filesystem any time in the last decade, you'll have
this file type information in your directories.
How ZFS stores this file type information is interesting and clever,
especially when it comes to backwards compatibility. I'll start by
quoting the comment from
/* * The directory entry has the type (currently unused on * Solaris) in the top 4 bits, and the object number in * the low 48 bits. The "middle" 12 bits are unused. */
In yesterday's entry I said that Unix directory entries need to store at least the filename and the inode number of the file. What ZFS is doing here is reusing the 64 bit field used for the 'inode' (the ZFS dnode number) to also store the file type, because it knows that object numbers have only a limited range. This also makes old directory entries compatible, by making type 0 (all 4 bits 0) mean 'not specified'. Since old directory entries only stored the object number and the object number is 48 bits or less, the higher bits are guaranteed to be all zero.
(It seems common to define
DT_UNKNOWN to be 0; both FreeBSD
and Linux do it.)
The reason this needed a new ZFS filesystem version is now clear. If you tried to read directory entries with file type information on a version of ZFS that didn't know about them, the old version would likely see crazy (and non-existent) object numbers and nothing would work. In order to even read a 'file type in directory entries' filesystem, you need to know to only look at the low 48 bits of the object number field in directory entries.
(As before, I consider this a neat hack that cleverly uses some properties of ZFS and the filesystem to its advantage.)