A surprise in how ZFS grows a file's record size (at least for me)

February 4, 2018

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. If a file has more than one block, all blocks are recordsize, no more and no less. If a file is a single block, the size of this block is based on how much data has been written to the file (or technically the maximum offset that's been written to the file). However, how the block size grows as you write data to the file turns out to be somewhat surprising (which makes me very glad that I actually did some experiments to verify what I thought I knew before I wrote this entry, because I was very wrong).

Rather than involving the ashift or growing in powers of two, ZFS always grows the (logical) block size in 512-byte chunks until it reaches the filesystem recordsize. The actual physical space allocated on disk is in ashift sized units, as you'd expect, but this is not directly related to the (logical) block size used at the file level. For example, here is a 16896 byte file (of incompressible data) on an ashift=12 pool:

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
4780566    1   128K  16.5K    20K     512  16.5K  100.00  ZFS plain file
0 L0 DVA[0]=<0:444bbc000:5000> [L0 ZFS plain file] [...] size=4200L/4200P [...]

The DVA records an 0x5000 byte allocation (20 Kb), but the logical and physical-logical size are only 0x4200 bytes (16.5 Kb).

In thinking about it, this makes a certain amount of sense because the ashift is really a vdev property, not a pool property, and can vary from vdev to vdev within a single pool. As a result, the actual allocated size of a given block may vary from vdev to vdev (and a block may be written to multiple vdevs if you have copies set to more than 1 or it's metadata). The file's current block size thus can't be based on the ashift, because ZFS doesn't necessarily have a single ashift to base it on; instead ZFS bases it on 512-byte sectors, even if this has to be materialized differently on different vdevs.

Looking back, I've already sort of seen this with ZFS compression. As you'd expect, a file's (logical) block size is based on its uncompressed size, or more exactly on the highest byte offset in the file. You can write something to disk that compresses extremely well, and it will still have a large logical block size. Here's an extreme case:

; dd if=/dev/zero of=testfile bs=128k count=1
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile

 Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956361    1   128K   128K      0     512   128K    0.00  ZFS plain file

This turns out to have no data blocks allocated at all, because the 128 Kb of zeros can be recorded entirely in magic flags in the dnode. But it still has a 128 Kb logical block size. 128 Kb of the character 'a' does wind up requiring a DVA allocation, but the size difference is drastic:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
956029    1   128K   128K     1K     512   128K  100.00  ZFS plain file
0 L0 DVA[0]=<0:3bbd1c00:400> [L0 ZFS plain file] [...] size=20000L/400P [...]

We have a compressed size of 1 Kb (and a 1 Kb allocation on disk, as this is an ashift=9 vdev), but once again the file block size is 128 Kb.

(If we wrote 127.5 Kb of 'a' instead, we'd wind up with a file block size of 127.5 Kb. I'll let interested parties do that experiment themselves.)

What this means is that ZFS has much less wasted space than I thought it did for files that are under the recordsize. Since such files grow their logical block size in 512-byte chunks, even with no compression they waste at most almost all of one physical block on disk (if you have a file that is, say, 32 Kb plus one byte, you'll have a physical block on disk with only one byte used). This has some implications for other areas of ZFS, but those are for another entry.

(This is one of those entries that I'm really glad that I decided to write. I set out to write it as a prequel to another entry just to have how ZFS grew the block size of files written down explicitly, but wound up upending my understanding of the whole area. The other lesson for me is that verifying my understanding with experiments is a really good idea, because every so often my folk understanding is drastically wrong.)

Comments on this page:

I really enjoy these posts you're doing on the internals of ZFS, keep 'em up.

I do wonder if this behaviour that you've investigated changes with ashift=12 when you've got disks with 4k physical block size. Is that something you'd consider checking up on?

By cks at 2018-02-13 23:32:23:

Belatedly: I've now done this test with my ashift=12 pool on 4k advance sector disks (as opposed to the ashift=12 pool on 512n drives). It too reports that a 16896 byte file has a dblk record size of 16.5K and so on, the same as the example in the article. So I think this section of ZFS is fully disconnected from the underlying reported disk sector size.

Written on 04 February 2018.
« More notes on using uMatrix in Firefox 56 (in place of NoScript)
I should remember that sometimes C is a perfectly good option »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Feb 4 22:28:55 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.