ZFS's recordsize as an honest way of keeping checksum overhead down
One of the classical tradeoffs of using checksums to verify the integrity of something (as ZFS does) is the choice of how large a chunk of data to cover with a single checksum. A large chunk size keeps the checksum overhead down, but it means that you have to process a large amount of data at once in order to verify or create the checksum. A large size also limits how specific you can be about what piece of data is damaged, which is important if you want to be able to recover some of your data.
(Recovery has two aspects. One is simply giving you access to as much of the undamaged data as possible. The other is how much data you have to process in order to heal corrupted data using various redundancy schemes. If you checksum over 16 Kbyte chunks and you have a single corrupted byte in a 1 Mbyte file, you can read 1008 Kbytes immediately and you only have to process the span of 16 Kbytes of data to recover from the corruption. If you checksum over 1 Mbyte chunks and have the same corruption, the entire file is unreadable and you're processing the span of 1 Mbyte of data to recover.)
If you're serious about checksums, you have to verify them on read and always create and update them on writes. This means that you have to operate on the entire checksum chunk size for these operations (even on partial chunk updates, depending on the checksum algorithm). Regardless of how the data is stored on disk, you to have all of the chunk available in memory to compute and recompute the checksum. So if you want to have a relatively large checksum chunk size in order to keep overhead down, you might as well make this your filesystem block size, because you're forced to do a lot of IO in the checksum chunk size no matter what.
This is effectively what ZFS does for files that have grown to their
full recordsize. The checksum chunk size is
recordsize and so is
the filesystem (logical) block size; ZFS stores one checksum for
recordsize chunk of the file (well, that actually exists).
This keeps the overhead of checksums down nicely, and setting the
logical filesystem block size to the checksum chunk size is honest
about what IO is actually happening (especially in a copy on write
If the ZFS logical block size was always
recordsize, this could
be a serious problem for small files. Ignoring compression, they would allocate far more space
than they needed, creating huge amounts of inefficiency (you could
have a 4 Kbyte file that had to allocate 128 Kbytes of disk space).
So instead ZFS has what is in effect a variable checksum chunk size
for small files, and with it a variable logical block size, in order
to store such files reasonably efficiently. As we've seen, ZFS works fairly hard to only
store the minimum amount of data it has to for small files (which
it defines as files below
(This model of why ZFS
recordsize exists and operates the way it
does didn't occur to me until I wrote yesterday's entry, but now that it has, I think I may finally
have the whole thing sorted out in my head.)