2018-02-14
Some things about ZFS block allocation and ZFS (file) record sizes
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's recordsize
, or using multiple recordsize
blocks. For a file under the recordsize, the block size turns
out to be in a multiple of 512 bytes, regardless
of the pool's ashift
or the physical sector size of the drives
the pool is using.
Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.
To talk about these sizes, I'll start with some illustrative zdb
output for a file data block, as before:
0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]
The first size of the three is the logical block size, before
compression. This is the first size=
number ('4200L' here, in hex
and L for logical). This is what grows in 512-byte units up to the
recordsize and so on.
The second size is the physical size after compression, if any;
this is the second size=
number ('4200P' here, P for physical).
It's a bit weird. If the file can't be compressed, it is the same
as the logical size and because the logical size goes in 512-byte
units, so does this size, even on ashift=12
pools. However, if
compression happens this size appears to go by the ashift
, which
means it doesn't necessarily go in 512-byte units. On an ashift=9
pool you'll see it go in 512-byte units (so you can have a compressed
size of '400P', ie 1 KB), but the same data written in an ashift=12
pool winds up being in 4 Kb units (so you wind up with a compressed
size of '1000P', ie 4 Kb).
The third size is the actual allocated size on disk, as recorded
in the DVA's asize field (which
is the third subfield in the DVA[0]
portion). This is always in
ashift
-based units, even if the physical size is not. Thus you
can wind up with a 20 KB DVA but a 16.5
KB 'physical' size, as in our example (the DVA is '5000' while the
block physical size is '4200').
(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)
For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.
On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.
PS: I don't know if it's possible to mix vdevs with different
ashift
s in the same pool. If it is, I don't know how ZFS would
decide what ashift
to use for the physical block size. The minimum
ashift
in any vdev? The maximum ashift
?
(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)