ZFS quietly discards all-zero blocks, but only sometimes

September 4, 2018

On the ZFS on Linux mailing list, a question came up about whether ZFS discards writes of all-zero blocks (as you'd get from 'dd if=/dev/zero of=...'), turning them into holes in your files or, especially, holes in your zvols. This is especially relevant for zvols, because if ZFS behaves this way it provides you with a way of returning a zvol to a sparse state from inside a virtual machine (or other environment using the zvol):

$ dd if=/dev/zero of=fillfile
[... wait for the disk to fill up ...]
$ rm -f fillfile

The answer turns out to be that ZFS does discard all-zero blocks and turn them into holes, but only if you have some sort of compression turned on (ie, that you don't have the default 'compression=off'). This isn't implemented as part of ZFS ZLE compression (or other compression methods); instead, it's an entirely separate check that looks only for an all-zero block and returns a special marker if that's what it has. As you'd expect, this check is done before ZFS tries whatever main compression algorithm you set.

Interestingly, there is a special compression level called 'empty' (ZIO_COMPRESS_EMPTY) that only does this special 'discard zeros' check. You can't set it from user level with something like 'compression=empty', but it's used internally in the ZFS code for a few things. For instance, if you turn off metadata compression with the zfs_mdcomp_disable tunable, metadata is still compressed with this 'empty' compression. Comments in the current ZFS on Linux source code suggest that ZFS relies on this to do things like discard blocks in dnode object sets where all the dnodes in the block are free (which apparently zeroes out the dnode).

There are two consequences of this. The first is that you should always set at least ZLE compression on zvols, even if their volblocksize is the same as your pool's ashift block size and so they can't otherwise benefit from compression (this would also apply to filesystems if you set an ashift-sized recordsize). The second is that it reinforces how you should basically always turn compression on on filesystems, even if you think you have mostly incompressible data. Not only do you save space at the end of files, but you get to drop any all-zero sections of sparse or pseudo-sparse files.

(Looking back, Richard Laager mentioned this zero block discarding for zvols back in a comment on this entry of mine, but apparently it didn't stick in my mind. Also, now I know the details.)

I took a quick look back through the history of ZFS's code, and as far as I could see, this zero-block discarding has always been there, right back to the beginnings of compression (which I believe came in with ZFS itself). ZIO_COMPRESS_EMPTY doesn't quite date back that far; instead, it was introduced along with zfs_mdcomp_disable, back in 2006.

(All of this is thanks to Gordan Bobic for raising the question in reply to me when I was confidently wrong, which led to me actually looking it up in the code.)


Comments on this page:

By Opk at 2018-09-04 12:39:17:

Any idea whether ZFS passes through TRIMs in the same manner? So would the fstrim command from util-linux do much the same as your dd if you've got some block you want to release back from a VM on a zvol.

By Etienne Dechamps at 2018-09-04 13:24:42:

Any idea whether ZFS passes through TRIMs in the same manner?

Yes, it does, and that's indeed a cleaner way to make zvols sparse than the zero-filling described in this post. For example, one can use blkdiscard(8) to discard (trim) sectors on a zvol. This was implemented years ago by yours truly in https://github.com/zfsonlinux/zfs/pull/553.

(Note: to dispel any confusion, this is about discarding blocks on zvols so that ZFS can reclaim the space for other things. This has nothing to do with ZFS itself discarding blocks on vdevs (e.g. SSDs), which is a completely different story.)

I also implemented the same for sparse files (ftruncate(FALLOC_FL_PUNCH_HOLE)), although it was much harder to use because you had to get the system call exactly right (only a very specific set of parameters were supported). However I believe that might have been fixed since then.

Written on 04 September 2018.
« Link: "The History of a Security Hole" (in various *BSD kernels)
Some views on the Go 2 Error Inspection early draft proposal »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Tue Sep 4 00:33:46 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.