Confirming the behavior of file block sizes in ZFS
ZFS filesystems have a property called their recordsize
, which
is usually described as something like the following (from here):
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
A while back I wrote about using zdb
to peer into how ZFS stores
files on disk, where I looked into how ZFS
stored a 160 Kb file and specifically if it really did use two 128
Kb blocks to hold it, instead of a 128 Kb block and a 32 Kb block.
The answer was yes, with some additional discoveries about ZFS
compression and partial blocks.
Today I wound up wondering once again if that informal description of how ZFS behaves was really truly the case. Specifically, I wondered if there were situations where ZFS could wind up with a mixture of block sizes, say a 4 Kb block that was written initially at the start of the file and then a larger block written later after a big hole in the file. If ZFS really always stored sufficiently large files with only recordsize blocks, it would have to go back to rewrite the initial 4 Kb block, which seemed a bit odd to me given ZFS's usual reluctance to rewrite things.
So I did this experiment. We start out with a 4 Kb file, sync it,
verify (with zdb
) that it's there on disk and looks like we expect,
and then extend the file with a giant hole, writing 32 Kb at 512
Kb into the file:
dd if=/dev/urandom of=testfile bs=4k count=1 sync [wait, check with zdb] dd if=/dev/urandom of=testfile bs=32k seek=19 count=1 conv=notrunc sync
The first write creates a testfile
that had a ZFS file block size
of 4 Kb (which zdb
prints as the dblk
field); this is the initial
conditions we expect. We can also see a single 4 Kb data block at
offset 0:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile [...] Indirect blocks: 0 L0 0:204ea46a00:1000 1000L/1000P F=1 B=5401327/5401327
After writing the additional 32 Kb, zdb
reports that the file's
block size has jumped up to 128 Kb, the standard ZFS dataset
recordsize
; this again is what we expect. However, it also
reports a change in the indirect blocks. They are now:
Indirect blocks: 0 L1 0:200fdf4200:400 20000L/400P F=2 B=5401362/5401362 0 L0 0:200fdf2e00:1400 20000L/1400P F=1 B=5401362/5401362 80000 L0 0:200fdeaa00:8400 20000L/8400P F=1 B=5401362/5401362
The L0 indirect block that starts at file offset 0 has changed.
It's been rewritten from a 4 Kb logical / 4 Kb physical block to
being 128 Kb logical and 5 Kb physical (this is still an ashift=9
pool), and the TXG it was created
in (the B=
field) is the same as the other blocks.
So what everyone says about the ZFS recordsize is completely true.
ZFS files only ever have one (logical) block size, which starts out
as small as it can be and then expands out as the file gets more
data (or, more technically, as the maximum offset of data in the
file increases). If you push it, ZFS will rewrite existing data
you're not touching in order to expand the (logical) block size out
to the dataset recordsize
.
If you think about it, this rewriting is not substantially different
from what happens if you write 4 Kb and then write another 4 Kb
after it. Just as here, ZFS will replace your initial 4 Kb data
block with an 8 Kb data block; it just feels more a bit more expected
because both the old and the new data falls within the first full
128 Kb recordsize
block of the file.
(Apparently, every so often something in ZFS feels sufficiently odd to me that I have to go confirm it for myself, just to be sure and so I can really believe in it without any lingering doubts.)
|
|