How changing a ZFS filesystem's recordsize affects existing files
The ZFS '
recordsize' property (on filesystems)
is a famously confusing ZFS property that is more or less the maximum
logical block size of files on the filesystem. However, you're
allowed to change it even after the filesystem has been created,
so this raises the question of what happens with existing files
when you do so. The simple answer is that existing files are
unaffected and continue to use the old recordsize. The technical
answer can be a lot more complicated, and to understand it I'm going
to start from the beginning.
Simplifying slightly, ZFS files (in fact all filesystem objects) are made up of zero or more logical blocks, which are all the same logical size (they may be different physical sizes on disk, for example because of ZFS compression). How big these blocks are is the file's (current) (logical) block size; all files have a logical block size. Normally there are two cases for the block size; either there is one block and it and the logical block size are growing up toward the filesystem's recordsize, or there's more than one logical block and the file's logical block size is frozen; all additional logical blocks added to the file will use the file's logical block size, whatever that is.
Under normal circumstances, a file will get its second logical block (and freeze its logical block size) at the point where its size goes about the filesystem's recordsize, which makes the file's final logical block size be the filesystem's recordsize. Sometimes this will be right away, when the file is created, because you wrote enough data to it on the spot. Sometimes this might be months after it's created, if you're appending a little bit of data to a log file once a day and your filesystem has the default 128 Kbyte recordsize (or a larger one).
So now we have some cases when you change recordsize. The simple one is that all files that have more than one logical block stick at the old recordsize, because that is their final logical block size (assuming you've only changed recordsize once; if you've changed it more than once, you could have a mixture of these sizes).
If a file has only a single logical block, its size (and logical block size) might be either below or above the new recordsize. If the file's single logical block is smaller than the new recordsize, the single block will grow up to that new recordsize and after that create its second logical block and freeze its logical block size. This is just the same as if the file was created from scratch under your new recordsize. This means that if you raise the recordsize, all files that currently have only one block will wind up using your new recordsize.
If the file's single block is larger than your new recordsize, the file will continue growing the logical block size of this single block up until it hits the next power of two size; after that it adds a second logical block and freezes the file's logical block size. So if you started out with a 512k recordsize, wrote a 200k file, set recordsize down to 128k, and continue writing to the file, it will wind up with a 256k logical block size. If you had written exactly 256k before lowering recordsize, the file would not grow its logical block size to 512k, because it would already be at a power of two.
In other words, there is an invariant that ZFS files with more than one block always have a logical block size that is a power of two. This likely exists because it makes it much easier to calculate which logical block a given offset falls into.
This means that if you lower the recordsize, all current files that have only a single block may wind up with an assortment of logical block sizes, depending on their current size and your new recordsize. If you drop recordsize from 128k to 16k, you could wind up with a collection of files that variously have 32k, 64k, and 128k logical block sizes.
(This is another version of an email I wrote to the ZFS mailing list today.)