Some things about ZFS block allocation and ZFS (file) record sizes

February 14, 2018

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. For a file under the recordsize, the block size turns out to be in a multiple of 512 bytes, regardless of the pool's ashift or the physical sector size of the drives the pool is using.

Well, sort of. While everything I've written is true, it also turns out to be dangerously imprecise (as I've seen before). There are actually three different sizes here and the difference between them matters once we start getting into the fine details.

To talk about these sizes, I'll start with some illustrative zdb output for a file data block, as before:

 0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The first size of the three is the logical block size, before compression. This is the first size= number ('4200L' here, in hex and L for logical). This is what grows in 512-byte units up to the recordsize and so on.

The second size is the physical size after compression, if any; this is the second size= number ('4200P' here, P for physical). It's a bit weird. If the file can't be compressed, it is the same as the logical size and because the logical size goes in 512-byte units, so does this size, even on ashift=12 pools. However, if compression happens this size appears to go by the ashift, which means it doesn't necessarily go in 512-byte units. On an ashift=9 pool you'll see it go in 512-byte units (so you can have a compressed size of '400P', ie 1 KB), but the same data written in an ashift=12 pool winds up being in 4 Kb units (so you wind up with a compressed size of '1000P', ie 4 Kb).

The third size is the actual allocated size on disk, as recorded in the DVA's asize field (which is the third subfield in the DVA[0] portion). This is always in ashift-based units, even if the physical size is not. Thus you can wind up with a 20 KB DVA but a 16.5 KB 'physical' size, as in our example (the DVA is '5000' while the block physical size is '4200').

(I assume this happens because ZFS insures that the physical size is never larger than the logical size, although the DVA allocated size may be.)

For obvious reasons, it's the actual allocated size on disk (the DVA asize) that matters for things like rounding up raidz allocation to N+1 blocks, fragmentation, and whether you need to use a ZFS gang block. If you write a 128 KB (logical) block that compresses to a 16 KB physical block, it's 16 KB of (contiguous) space that ZFS needs to find on disk, not 128 KB.

On the one hand, how much this matters depends on how compressible your data is and much modern data isn't (because it's already been compressed in its user-level format). On the other hand, as I found out, 'sparse' space after the logical end of file is very compressible. A 160 KB file on a standard 128 KB recordsize filesystem takes up two 128 KB logical blocks, but the second logical block has 96 KB of nothingness at the end and that compresses down to almost nothing.

PS: I don't know if it's possible to mix vdevs with different ashifts in the same pool. If it is, I don't know how ZFS would decide what ashift to use for the physical block size. The minimum ashift in any vdev? The maximum ashift?

(This is the second ZFS entry in a row where I thought I knew what was going on and it was simple, and then discovered that I didn't and it isn't.)

Written on 14 February 2018.
« Writing my first addon for Firefox wasn't too hard or annoying
DTrace being GPL (and thrown into a Linux kernel) is just the start »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Feb 14 00:49:29 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.