Wandering Thoughts


Some basic ZFS ARC statistics and prefetching

I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?

(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)

If you read the arcstat manpage or look at the raw ARC kstats, you'll very soon notice things like 'prefetch hits percentage' arcstat field and a prefetch_data_hits kstat. Unfortunately these prefetch-related kstats will not give us what we want, because they mean something different that what you might expect.

The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.

(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)

All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:

  • The ARC demand hit rate (for data and metadata together) is probably the most important thing for whether the ARC is giving you good results, although this partly depends on the absolute volume of demand reads and a few other things. Demand misses mean that programs are waiting on disk IO.

  • The breakdown of *_miss kstats will tell you why ZFS is reading things from disk. You would generally like this to be prefetch reads instead of demand reads, because at least things aren't waiting on prefetch reads.

  • The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.

One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.

Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:

  • demand_hit_predictive_prefetch counts demand reads that found data in the ARC from prefetch reads. This sounds exactly like what we want, but experimentally it doesn't seem to come anywhere near fully accounting for hits on prefetched data; I see low rates of it when I am also seeing a 100% demand hit rate for sequentially read data that was not previously in the ARC.

  • sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.

My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.

If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.

If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.

One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.

(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)

ZFSARCStatsAndPrefetch written at 23:17:12; Add Comment


Today I (re-)learned that top's output can be quietly system dependent

I'll start with a story that is the background. A few days ago I tweeted:

Current status: zfs send | zfs recv at 33 Mbytes/sec. This will take a while, and the server with SSDs and 10G networking is rather bored.

(It's not CPU-limited at either end and I don't think it's disk-limited. Maybe too many synchronous reads or something.)

I was wrong about this being disk-limited, as it turned out, and then Allan Jude had the winning suggestion:

Try adding '-c aes128-gcm@openssh.com' to your SSH invocation.

See also: <pdf link>

(If you care about 10G+ SSH, you want to read that PDF.)

This made a huge difference, giving me basically 1G wire speeds for my ZFS transfers. But that difference made me scratch my head, because why was switching SSH ciphers making a difference when ssh wasn't CPU-limited in the first place? I came up with various theories and guesses, until today I had a sudden terrible suspicion. The result of testing and confirming that suspicion was another tweet:

Today I learned or re-learned a valuable lesson: in practice, top output is system dependent, in ways that are not necessarily obvious. For instance, CPU % on multi-CPU systems.

(On some systems, CPU % is the percent of a single CPU; on some it's a % of all CPUs.)

You see, the reason that I had confidently known that SSH wasn't CPU-limited on sending machine, which was one of our OmniOS fileservers, is that I had run top and seen that the ssh process was only using 25% of the CPU. Case closed.

Except that OmniOS top and Linux's top report CPU usage percentages differently. On Linux, CPU percentage is relative to a single CPU, so 25% is a quarter of one CPU, 100% is all of it, and over 100% is a multi-threaded program that is using up more than one CPU's worth of CPU time. On OmniOS, the version of top we're using comes from pkgsrc (in what is by now a very old version), and that version reports CPU percentage relative to all CPUs in the machine. Our OmniOS fileservers are 4-CPU machines, so that '25% CPU' was actually 'all of a single CPU'. In other words, I was completely wrong about the sending ssh not being CPU-limited. Since ssh was CPU limited after all, it's suddenly no surprise why switching ciphers sped things up to basically wire speed.

(Years ago I established that the old SunSSH that OmniOS was using back then was rather slow, but then later we upgraded to OpenSSH and I sort of thought that I could not worry about SSH speeds any more. Well, I was wrong. Of course, nothing can beat not doing SSH at all but instead using, say, mbuffer. Using mbuffer also means that you can deliberately limit your transfer bandwidth to leave some room for things like NFS fileservice.)

PS: There are apparently more versions than you might think. On the FreeBSD 10.4 machine I have access to, top reports CPU percentage in the same way Linux does (100% is a single-threaded process using all of one CPU). Although both the FreeBSD version and our OmniOS version say they're the William LeFebvre implementation and have similar version numbers, apparently they diverged significantly at some point, probably when people had to start figuring out how to make the original version of top deal with multi-CPU machines.

TopCPUPercentDifference written at 23:01:36; Add Comment


Some views on more flexible (Illumos) kernel crash dumps

In my notes about Illumos kernel crash dumps, I mentioned that we've now turned them off on our OmniOS fileservers. One of the reasons for this is that we're running an unsupported version of OmniOS, including the kernel. But even if we were running the latest OmniOS CE and had commercial support, we'd do the same thing (at least by default, outside of special circumstances). The core problem is that our needs conflict with what Illumos crash dumps want to give us right now.

The current implementation of kernel crash dumps basically prioritizes capturing complete information. There are various manifestations of this in the implementation, starting with how it assumes that if crash dumps are configured at all, you have set up enough disk space to hold the full crash dump level you've set in dumpadm, so it's sensible to not bother checking if the dump will fit and treating failure to fit as an unusual situation that is not worth doing much special about. Another one is the missing feature that there is no overall time limit on how long the crash dump will run, which is perfectly sensible if the most important thing is to capture the crash dump for diagnosis.

But, well, the most important thing is not always to capture complete diagnostic information. Sometimes you need to get things back into service before too long, so what you really want is to capture as much information as possible while still returning to service in a certain amount of time. Sometimes you only have so much disk space available for crash dumps, and you would like to capture whatever information can fit in that disk space, and if not everything fits it would be nice if the most important things were definitely captured.

All of this makes me wish that Illumos kernel crash dumps wrote certain critical information immediately, at the start of the crash dump, and then progressively extended the information in the crash dump until they either ran out of space or ran out of time. What do I consider critical? My first approximation would be, in order, the kernel panic, the recent kernel messages, the kernel stack of the panicing kernel process, and the kernel stacks of all processes. Probably you'd also want anything recent from the kernel fault manager.

The current Illumos crash dump code does have an order for what gets written out, and it does put some of this stuff into the dump header, but as far as I can tell the dump header only gets written at the end. It's possible that you could create a version of this incremental dump approach by simply writing out the incomplete dump header every so often (appropriately marked with how it's incomplete). There's also a 'dump summary' that gets written at the end that appears to contain a bunch of this information; perhaps a preliminary copy could be written at the start of the dump, then overwritten at the end if the dump is complete. Generally what seems to take all the time (and space) with our dumps is the main page writing stuff, not a bunch of preliminary stuff, so I think Illumos could definitely write at least one chunk of useful information before it bogs down. And if this needs extra space in the dump device, I would gladly sacrifice a few megabytes to have such useful information always present.

(It appears that the Illumos kernel already keeps a lot of ZFS data memory out of kernel crash dumps, both for the ARC and for in-flight ZFS IO, so I'm not sure what memory the kernel is spending all of its time dumping in our case. Possibly we have a lot of ZFS metadata, which apparently does go into crash dumps. See the comments about crash dumps in abd.c and zio.c. For the 'dump summary', see the dump_summary, dump_ereports, and dump_messages functions in dumpsubr.c.)

PS: All of this is sort of wishing from the sidelines, since our future is not with Illumos.

ImprovingCrashDumps written at 23:46:22; Add Comment


Some notes about kernel crash dumps in Illumos

I tweeted:

On our OmniOS servers, we should probably turn off writing kernel crash dumps on panics. It takes far too long, it usually doesn't succeed, and even if it did the information isn't useful to us in practice (we're using a very outdated version & we're frozen on it).

We're already only saving kernel pages, which is the minimum setting in dumpadm, but our fileservers still take at least an hour+ to write dumps. On a panic, we need them back in service in minutes (as few as possible).

The resulting Twitter discussion got me to take a look into the current state of the code for this in Illumos, and I wound up discovering some potentially interesting things. First off, dump settings are not auto-loaded or auto-saved by the kernel in some magical way; instead dumpadm saves all of your configuration settings in /etc/dumpadm.conf and then sets them during boot through svc:/system/dumpadm:default. The dumpadm manual page will tell you all of this if you read its description of the -u argument.

Next, the -z argument to dumpadm is inadequately described in the manual page. The 'crash dump compression' it's talking about is whether savecore will write compressed dumps; it has nothing to do with how the kernel writes out the crash dump to your configured crash device. In fact, dumpadm has no direct control over basically any of that process; if you want to change things about the kernel dump process, you need to set kernel variables through /etc/system (or 'mdb -k').

The kernel writes crash dumps in multiple steps. If your console shows the message 'dumping to <something>, offset NNN, contents: <...>', then you've at least reached the start of writing out the crash dump. If you see updates of the form 'dumping: MM:SS N% done', the kernel has reached the main writeout loop and is writing out pages of memory, perhaps excessively slowly. As far as I can tell from the code, crash dumps don't abort when they run out of space on the dump device; they keep processing things and just throw all of the work away.

As it turns out, the kernel always compresses memory as it writes it out, although this is obscured by the current state of the code. The short version is that unless you set non-default system parameters that you probably don't want to, current Illumos systems will always do single threaded lzjb compression of memory (where the CPU that is writing out the crash dump also compresses the buffers before writing). Although you can change things to do dumps with multi-threaded compression using either lzjb or bzip2, you probably don't want to, because the multi-threaded code has been deliberately disabled and is going to be removed sometime. See Illumos issue 3314 and the related Illumos issue 1369.

(As a corollary of kernel panic dumps always compressing with at least ljzb, you probably should not have compression turned on on your dump zvol (which I believe is the default).)

I'm far from convinced that single threaded lzjb compression can reach and sustain the full write speed of our system SSDs on our relatively slow CPUs, especially during a crash dump (when I believe there's relatively little write buffering going on), although for obvious reasons it's hard to test. People with NVMe drives might have problems even with modern fast hardware.

If you examine the source of dumpsubr.c, you'll discover a tempting variable dump_timeout that's set to 120 (seconds) and described as 'timeout for dumping pages'. This comment is a little bit misleading, as usual; what it really means is 'timeout for dumping a single set of pages'. There is no limit on how long the kernel is willing to keep writing out pages for, provided that it makes enough progress within 120 seconds. In our case this is unfortunate, since we'd be willing to spend a few minutes to gather a bit of crash information but not anything like what a kernel dump appears to take on our machines.

(The good news is that if you run out of space on your dump device, the dump code is at least smart enough to not spend any more time trying to compress pages; it just throws them away right away. You might run out of space because you're taking a panic dump from a ZFS fileserver with 128 GB of RAM and putting it on an 8GB dump zvol that is part of a rpool that lives on 80 GB SSDs, where a full-sized kernel dump almost certainly can't even be saved by savecore.)

PS: To see that the default is still a single-threaded crash dump, you need to chase through the code to dumphdr.h and the various DUMP_PLAT_*_MINCPU definitions, all of which are set to 0. Due to how the code is structured, this disables multi-threaded dumps entirely.

Sidebar: The theoretical controls for multi-threaded dumps

If you set dump_plat_mincpu to something above 0, then if you have 'sufficiently more' CPUs than this, you will get parallel bzip2 compression; below that you will get parallel lzjb. Since parallel compression is disabled by default in Illumos, this may or may not actually still work, even if you don't run into any actual bugs of the sort that caused it to be disabled in the first place. Note that bzip2 is not fast.

The actual threshold of 'enough' depends on the claimed maximum transfer size of your disks. For dumping to zvols, it appears that this maximum transfer size is always 128 KB, which uses a code path where the breakpoint between parallel lzjb and parallel bzip2 is just dump_plat_mincpu; if you have that many CPUs or more, you get bzip2. This implies that you may want to set dump_plat_mincpu to a nice high number so that you get parallel lzjb all the time.

IllumosCrashDumpNotes written at 01:37:56; Add Comment


You can sort of use zdb as a substitute for a proper ZFS fsck

One of the things said repeatedly and correctly about ZFS is that it has no equivalent of fsck. ZFS scrubs will check that all of the blocks in your pool checksum correctly and that the pool's metadata is generally intact, but that's it (as covered yesterday). A ZFS scrub will detect and repair damaged on-disk data, but it will not do anything about mistakes and accidents inside ZFS itself, including ACL attributes that are internally inconsistent. These things are not supposed to happen but they do, partly because the ZFS code has bugs (see, for example, Illumos issue 9847).

If you look at a conventional filesystem's fsck from the right angle, it does two things; it finds corrupted portions of your filesystem and tells you about them, then it fixes them for you as much as it can or at least recovers as much data as possible. ZFS doesn't have something that does the 'repair' portion of that and probably never will, but it does have something that does at least part of the first job, that of scanning your ZFS pool and finding things wrong with how it is put together. That thing is ZDB.

ZDB started out life as a deliberately undocumented internal (Open)Solaris tool. Back in the days of Solaris 10, the only way to learn how to use it was to either run it to get vague help messages or read the source code (doing both was recommended). I'm not sure when it gained an actual manual page, but it was after we started using Solaris 10 on our first generation of ZFS NFS fileservers, but the manpage has apparently been there for a while now; some spelunking suggests that it may have shown up in early 2012 through Illumos issue 2088. By itself that was welcome, because ZDB is really your only tool to introspect the details of any oddities in your ZFS pools.

However, these days it has become more than just an internal debugging tool. As suggested by the second paragraph of Illumos issue 9847 and also the ZDB manpage itself, ZDB has become the place that people put at least some (meta)data consistency checking for ZFS pools. Right now this appears to just be looking for space leaks under the right circumstances (as part of 'zdb -b' or 'zdb -c'). However in the future it's possible that ZDB may do more consistency checking if asked, because there's at least the camel's nose in the tent and ZDB is not a bad place for it.

When I started writing this entry I was optimistically hoping that I'd find various sorts of consistency checking in ZDB. Unfortunately I'm wrong, although I think that you could use ZDB with some add-on tooling to do things like verify that all directory entries in a filesystem referred to live dnodes (since I believe you can dump all dnodes in a ZFS filesystem, including showing the ZAPs for directories; then you could post-process the dump). Possibly the ZFS developers feel that additional offline tools are the best choice for various reasons.

PS: As far as I know ZDB can't be used to repair space leaks and the like, but if you use it to discover a big one at least you know it's time to back up the pool, destroy it, and start over from scratch.

PPS: I continue to strongly believe that ZFS should have something that at least scans your pool for all sorts of correctness and consistency issues, because things keep happening in ZFS code that result in damaged filesystems. But so far no one considers this a high enough priority to develop tools for it, and I suppose I can't blame them; the large system solution to 'my filesystem is corrupted' is 'restore from last night's backups'. Certainly it would be our solution here.

ZFSZdbAsFsck written at 23:14:28; Add Comment


ZFS scrubs check (much) less than you probably think they do

Several years ago I wrote an entry on the limits of what ZFS scrubs check. In that entry I said:

The simple version of what a ZFS scrub does is that it verifies the checksum for every copy of every (active) block in the ZFS pool. It also explicitly verifies parity blocks for RAIDZ vdevs (which a normal error-free read does not). In the process of doing this verification, the scrub must walk the entire object tree of the pool from the top downwards, which has the side effect of more or less verifying this hierarchy; certainly if there's something like a directory entry that points to an invalid thing, you will get a checksum error somewhere in the process.

(The emphasis is new.)

As I wrote this and as people will read it, I am pretty sure that this is incorrect, because at the time I did not understand how ZFS filesystems and pools were really structured and how this made ZFS scrubs fundamentally different from the way that fsck usually works.

The straightforward and ordinary way that fsck programs are written for conventional filesystems is that they start at the root directory of the filesystem and follow everything down from there, eventually looking at every live file and object. In the process they build up a map of the disk blocks and inodes that are in use and free, and how many links each inode is supposed to have, and so on, and they can detect various sorts of inconsistencies in this data. Because they walk through the entire filesystem directory tree, they always notice if your directories are corrupt; reading through your directories is how they figure out what to do next.

ZFS scrubs famously don't verify that various sorts of filesystem metadata are correct; for example, the ZFS filesystem with bad ACLs that I mentioned in this entry passes pool scrubs. But until recently I thought that ZFS scrubs still traversed your ZFS pool and filesystems in the same way that fsck did, and in the process they more or less verified the integrity of your ZFS filesystem directories for the same reason, because that's how they knew what to visit next. If you had a corrupt entry that pointed to nothing or to an unallocated dnode or something, a scrub would either complain or panic (but at least you'd know).

But ZFS filesystems and ZFS pools are not really organized this way, as I found out when I actually did my research. Instead, each ZFS filesystem is in essence an object set of dnodes plus some extra information. Each dnode is self-contained; given only a block pointer to a dnode, you can completely verify the checksums of all of the dnode's data, without really having to know much about what that data actually means. This means that if all you care about is that the checksums of everything in a filesystem is correct, all you have to do is fetch the filesystem's object set and then verify the checksums of every allocated dnode in it. ZFS doesn't have to walk through the filesystem's directory tree to verify all of its checksums, and I am pretty sure that ZFS scrubs and resilvers don't bother to do so.

As a result, provided that all of the block checksums verify, ZFS scrubs are very likely to be splendidly indifferent to things like what is actually in your filesystem directories and what dnode object numbers your files claim to be and so on. Scrubs need to use and thus verify a bit of the dnode structure simply in order to find all of its data blocks through indirect blocks, but they don't need to even look at a lot of other things associated with dnodes (such as the structure of system attributes). It's possible that verifying the block checksums of filesystem directories requires some analysis of their general structure, but that general structure is generic.

(ZFS filesystem directories are ZAP objects, which are a generic ZFS thing to used to store name/value pairs. You can read through all of the disk blocks of a ZAP object without knowing what the keys and their values mean or if they mean anything, although I think you'll basically verify that the actual hash table structure is correct.)

(What I wrote is potentially technically correct in that there are DSL (Dataset and Snapshot Layer) directories and so on, and scrubs may have to traverse through them to find the object sets of your filesystems (see the discussion in my broad overview of how ZFS is structured on disk). But I didn't even really understand those when I wrote my entry, and I was talking about ZFS filesystem directories.)

ZFSScrubLimitsII written at 23:47:52; Add Comment


How you migrate ZFS filesystems matters

If you want to move a ZFS filesystem around from one host to another, you have two general approaches; you can use 'zfs send' and 'zfs receive', or you can use a user level copying tool such as rsync (or 'tar -cf | tar -xf', or any number of similar options). Until recently, I had considered these two approaches to be more or less equivalent apart from their convenience and speed (which generally tilted in favour of 'zfs send'). It turns out that this is not necessarily the case and there are situations where you will want one instead of the other.

We have had two generations of ZFS fileservers so far, the Solaris ones and the OmniOS ones. When we moved from the first generation to the second generation, we migrated filesystems across using 'zfs send', including the filesystem with my home directory in it (we did this for various reasons). Recently I discovered that some old things in my filesystem didn't have file type information in their directory entries. ZFS has been adding file type information to directories for a long time, but not quite as long as my home directory has been on ZFS.

This illustrates an important difference between the 'zfs send' approach and the rsync approach, which is that zfs send doesn't update or change at least some ZFS on-disk data structures, in the way that re-writing them from scratch from user level does. There are both positives and negatives to this, and a certain amount of rewriting does happen even in the 'zfs send' case (for example, all of the block pointers get changed, and ZFS will re-compress your data as applicable).

I knew that in theory you had to copy things at the user level if you wanted to make sure that your ZFS filesystem and everything in it was fully up to date with the latest ZFS features. But I didn't expect to hit a situation where it mattered in practice until, well, I did. Now I suspect that old files on our old filesystems may be partially missing a number of things, and I'm wondering how much of the various changes in 'zfs upgrade -v' apply even to old data.

(I'd run into this sort of general thing before when I looked into ext3 to ext4 conversion on Linux.)

With all that said, I doubt this will change our plans for migrating our ZFS filesystems in the future (to our third generation fileservers). ZFS sending and receiving is just too convenient, too fast and too reliable to give up. Rsync isn't bad, but it's not the same, and so we only use it when we have to (when we're moving only some of the people in a filesystem instead of all of them, for example).

PS: I was going to try to say something about what 'zfs send' did and didn't update, but having looked briefly at the code I've concluded that I need to do more research before running my keyboard off. In the mean time, you can read the OpenZFS wiki page on ZFS send and receive, which has plenty of juicy technical details.

PPS: Since eliminating all-zero blocks is a form of compression, you can turn zero-filled files into sparse files through a ZFS send/receive if the destination has compression enabled. As far as I know, genuine sparse files on the source will stay sparse through a ZFS send/receive even if they're sent to a destination with compression off.

ZFSSendRecvVsRsync written at 00:17:58; Add Comment


ZFS quietly discards all-zero blocks, but only sometimes

On the ZFS on Linux mailing list, a question came up about whether ZFS discards writes of all-zero blocks (as you'd get from 'dd if=/dev/zero of=...'), turning them into holes in your files or, especially, holes in your zvols. This is especially relevant for zvols, because if ZFS behaves this way it provides you with a way of returning a zvol to a sparse state from inside a virtual machine (or other environment using the zvol):

$ dd if=/dev/zero of=fillfile
[... wait for the disk to fill up ...]
$ rm -f fillfile

The answer turns out to be that ZFS does discard all-zero blocks and turn them into holes, but only if you have some sort of compression turned on (ie, that you don't have the default 'compression=off'). This isn't implemented as part of ZFS ZLE compression (or other compression methods); instead, it's an entirely separate check that looks only for an all-zero block and returns a special marker if that's what it has. As you'd expect, this check is done before ZFS tries whatever main compression algorithm you set.

Interestingly, there is a special compression level called 'empty' (ZIO_COMPRESS_EMPTY) that only does this special 'discard zeros' check. You can't set it from user level with something like 'compression=empty', but it's used internally in the ZFS code for a few things. For instance, if you turn off metadata compression with the zfs_mdcomp_disable tunable, metadata is still compressed with this 'empty' compression. Comments in the current ZFS on Linux source code suggest that ZFS relies on this to do things like discard blocks in dnode object sets where all the dnodes in the block are free (which apparently zeroes out the dnode).

There are two consequences of this. The first is that you should always set at least ZLE compression on zvols, even if their volblocksize is the same as your pool's ashift block size and so they can't otherwise benefit from compression (this would also apply to filesystems if you set an ashift-sized recordsize). The second is that it reinforces how you should basically always turn compression on on filesystems, even if you think you have mostly incompressible data. Not only do you save space at the end of files, but you get to drop any all-zero sections of sparse or pseudo-sparse files.

(Looking back, Richard Laager mentioned this zero block discarding for zvols back in a comment on this entry of mine, but apparently it didn't stick in my mind. Also, now I know the details.)

I took a quick look back through the history of ZFS's code, and as far as I could see, this zero-block discarding has always been there, right back to the beginnings of compression (which I believe came in with ZFS itself). ZIO_COMPRESS_EMPTY doesn't quite date back that far; instead, it was introduced along with zfs_mdcomp_disable, back in 2006.

(All of this is thanks to Gordan Bobic for raising the question in reply to me when I was confidently wrong, which led to me actually looking it up in the code.)

ZFSZeroBlockDiscarding written at 00:33:46; Add Comment


A little bit of the one-time MacOS version still lingers in ZFS

Once upon a time, Apple came very close to releasing ZFS as part of MacOS. Apple did this work in its own copy of the ZFS source base (as far as I know), but the people in Sun knew about it and it turns out that even today there is one little lingering sign of this hoped-for and perhaps prepared-for ZFS port in the ZFS source code. Well, sort of, because it's not quite in code.

Lurking in the function that reads ZFS directories to turn (ZFS) directory entries into the filesystem independent format that the kernel wants is the following comment:

 objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
  * MacOS X can extract the object type here such as:
  * uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);

(Specifically, this is in zfs_readdir in zfs_vnops.c .)

ZFS maintains file type information in directories. This information can't be used on Solaris (and thus Illumos), where the overall kernel doesn't have this in its filesystem independent directory entry format, but it could have been on MacOS ('Darwin'), because MacOS is among the Unixes that support d_type. The comment itself dates all the way back to this 2007 commit, which includes the change 'reserve bits in directory entry for file type', which created the whole setup for this.

I don't know if this file type support was added specifically to help out Apple's MacOS X port of ZFS, but it's certainly possible, and in 2007 it seems likely that this port was at least on the minds of ZFS developers. It's interesting but understandable that FreeBSD didn't seem to have influenced them in the same way, at least as far as comments in the source code go; this file type support is equally useful for FreeBSD, and the FreeBSD ZFS port dates to 2007 too (per this announcement).

Regardless of the exact reason that ZFS picked up maintaining file type information in directory entries, it's quite useful for people on both FreeBSD and Linux that it does so. File type information is useful for any number of things and ZFS filesystems can (and do) provide this information on those Unixes, which helps make ZFS feel like a truly first class filesystem, one that supports all of the expected general system features.

ZFSDTypeAndMacOS written at 21:24:29; Add Comment

How ZFS maintains file type information in directories

As an aside in yesterday's history of file type information being available in Unix directories, I mentioned that it was possible for a filesystem to support this even though its Unix didn't. By supporting it, I mean that the filesystem maintains this information in its on disk format for directories, even though the rest of the kernel will never ask for it. This is what ZFS does.

(One reason to do this in a filesystem is future-proofing it against a day when your Unix might decide to support this in general; another is if you ever might want the filesystem to be a first class filesystem in another Unix that does support this stuff. In ZFS's case, I suspect that the first motivation was larger than the second one.)

The easiest way to see that ZFS does this is to use zdb to dump a directory. I'm going to do this on an OmniOS machine, to make it more convincing, and it turns out that this has some interesting results. Since this is OmniOS, we don't have the convenience of just naming a directory in zdb, so let's find the root directory of a filesystem, starting from dnode 1 (as seen before).

# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]
    microzap: 512 bytes, 4 entries
         ROOT = 3 

# zdb -dddd fs3-corestaff-01/h/281 3
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        3    1    16K     1K     8K     1K  100.00  ZFS directory
    microzap: 1024 bytes, 8 entries

         RESTORED = 4396504 (type: Directory)
         ckstst = 12017 (type: not specified)
         ckstst3 = 25069 (type: Directory)
         .demo-file = 5832188 (type: Regular File)
         .peergroup = 12590 (type: not specified)
         cks = 5 (type: not specified)
         cksimap1 = 5247832 (type: Directory)
         .diskuse = 12016 (type: not specified)
         ckstst2 = 12535 (type: not specified)

This is actually an old filesystem (it dates from Solaris 10 and has been transferred around with 'zfs send | zfs recv' since then), but various home directories for real and test users have been created in it over time (you can probably guess which one is the oldest one). Sufficiently old directories and files have no file type information, but more recent ones have this information, including .demo-file, which I made just now so this would have an entry that was a regular file with type information.

Once I dug into it, this turned out to be a change introduced (or activated) in ZFS filesystem version 2, which is described in 'zfs upgrade -v' as 'enhanced directory entries'. As an actual change in (Open)Solaris, it dates from mid 2007, although I'm not sure what Solaris release it made it into. The upshot is that if you made your ZFS filesystem any time in the last decade, you'll have this file type information in your directories.

How ZFS stores this file type information is interesting and clever, especially when it comes to backwards compatibility. I'll start by quoting the comment from zfs_znode.h:

 * The directory entry has the type (currently unused on
 * Solaris) in the top 4 bits, and the object number in
 * the low 48 bits.  The "middle" 12 bits are unused.

In yesterday's entry I said that Unix directory entries need to store at least the filename and the inode number of the file. What ZFS is doing here is reusing the 64 bit field used for the 'inode' (the ZFS dnode number) to also store the file type, because it knows that object numbers have only a limited range. This also makes old directory entries compatible, by making type 0 (all 4 bits 0) mean 'not specified'. Since old directory entries only stored the object number and the object number is 48 bits or less, the higher bits are guaranteed to be all zero.

(It seems common to define DT_UNKNOWN to be 0; both FreeBSD and Linux do it.)

The reason this needed a new ZFS filesystem version is now clear. If you tried to read directory entries with file type information on a version of ZFS that didn't know about them, the old version would likely see crazy (and non-existent) object numbers and nothing would work. In order to even read a 'file type in directory entries' filesystem, you need to know to only look at the low 48 bits of the object number field in directory entries.

(As before, I consider this a neat hack that cleverly uses some properties of ZFS and the filesystem to its advantage.)

ZFSAndDirectoryDType written at 00:43:13; Add Comment

(Previous 10 or go back to July 2018 at 2018/07/29)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.