Confirming the behavior of file block sizes in ZFS

January 5, 2018

ZFS filesystems have a property called their recordsize, which is usually described as something like the following (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

A while back I wrote about using zdb to peer into how ZFS stores files on disk, where I looked into how ZFS stored a 160 Kb file and specifically if it really did use two 128 Kb blocks to hold it, instead of a 128 Kb block and a 32 Kb block. The answer was yes, with some additional discoveries about ZFS compression and partial blocks.

Today I wound up wondering once again if that informal description of how ZFS behaves was really truly the case. Specifically, I wondered if there were situations where ZFS could wind up with a mixture of block sizes, say a 4 Kb block that was written initially at the start of the file and then a larger block written later after a big hole in the file. If ZFS really always stored sufficiently large files with only recordsize blocks, it would have to go back to rewrite the initial 4 Kb block, which seemed a bit odd to me given ZFS's usual reluctance to rewrite things.

So I did this experiment. We start out with a 4 Kb file, sync it, verify (with zdb) that it's there on disk and looks like we expect, and then extend the file with a giant hole, writing 32 Kb at 512 Kb into the file:

dd if=/dev/urandom of=testfile bs=4k count=1
[wait, check with zdb]
dd if=/dev/urandom of=testfile bs=32k seek=19 count=1 conv=notrunc

The first write creates a testfile that had a ZFS file block size of 4 Kb (which zdb prints as the dblk field); this is the initial conditions we expect. We can also see a single 4 Kb data block at offset 0:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile
Indirect blocks:
     0 L0 0:204ea46a00:1000 1000L/1000P F=1 B=5401327/5401327

After writing the additional 32 Kb, zdb reports that the file's block size has jumped up to 128 Kb, the standard ZFS dataset recordsize; this again is what we expect. However, it also reports a change in the indirect blocks. They are now:

Indirect blocks:
     0 L1  0:200fdf4200:400 20000L/400P F=2 B=5401362/5401362
     0  L0 0:200fdf2e00:1400 20000L/1400P F=1 B=5401362/5401362
 80000  L0 0:200fdeaa00:8400 20000L/8400P F=1 B=5401362/5401362

The L0 indirect block that starts at file offset 0 has changed. It's been rewritten from a 4 Kb logical / 4 Kb physical block to being 128 Kb logical and 5 Kb physical (this is still an ashift=9 pool), and the TXG it was created in (the B= field) is the same as the other blocks.

So what everyone says about the ZFS recordsize is completely true. ZFS files only ever have one (logical) block size, which starts out as small as it can be and then expands out as the file gets more data (or, more technically, as the maximum offset of data in the file increases). If you push it, ZFS will rewrite existing data you're not touching in order to expand the (logical) block size out to the dataset recordsize.

If you think about it, this rewriting is not substantially different from what happens if you write 4 Kb and then write another 4 Kb after it. Just as here, ZFS will replace your initial 4 Kb data block with an 8 Kb data block; it just feels more a bit more expected because both the old and the new data falls within the first full 128 Kb recordsize block of the file.

(Apparently, every so often something in ZFS feels sufficiently odd to me that I have to go confirm it for myself, just to be sure and so I can really believe in it without any lingering doubts.)

Written on 05 January 2018.
« The goals and problems of our Dovecot IMAP configuration migration
What ZFS gang blocks are and why they exist »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jan 5 01:33:44 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.