Wandering Thoughts archives


ZFS's recordsize as an honest way of keeping checksum overhead down

One of the classical tradeoffs of using checksums to verify the integrity of something (as ZFS does) is the choice of how large a chunk of data to cover with a single checksum. A large chunk size keeps the checksum overhead down, but it means that you have to process a large amount of data at once in order to verify or create the checksum. A large size also limits how specific you can be about what piece of data is damaged, which is important if you want to be able to recover some of your data.

(Recovery has two aspects. One is simply giving you access to as much of the undamaged data as possible. The other is how much data you have to process in order to heal corrupted data using various redundancy schemes. If you checksum over 16 Kbyte chunks and you have a single corrupted byte in a 1 Mbyte file, you can read 1008 Kbytes immediately and you only have to process the span of 16 Kbytes of data to recover from the corruption. If you checksum over 1 Mbyte chunks and have the same corruption, the entire file is unreadable and you're processing the span of 1 Mbyte of data to recover.)

If you're serious about checksums, you have to verify them on read and always create and update them on writes. This means that you have to operate on the entire checksum chunk size for these operations (even on partial chunk updates, depending on the checksum algorithm). Regardless of how the data is stored on disk, you to have all of the chunk available in memory to compute and recompute the checksum. So if you want to have a relatively large checksum chunk size in order to keep overhead down, you might as well make this your filesystem block size, because you're forced to do a lot of IO in the checksum chunk size no matter what.

This is effectively what ZFS does for files that have grown to their full recordsize. The checksum chunk size is recordsize and so is the filesystem (logical) block size; ZFS stores one checksum for every recordsize chunk of the file (well, that actually exists). This keeps the overhead of checksums down nicely, and setting the logical filesystem block size to the checksum chunk size is honest about what IO is actually happening (especially in a copy on write filesystem).

If the ZFS logical block size was always recordsize, this could be a serious problem for small files. Ignoring compression, they would allocate far more space than they needed, creating huge amounts of inefficiency (you could have a 4 Kbyte file that had to allocate 128 Kbytes of disk space). So instead ZFS has what is in effect a variable checksum chunk size for small files, and with it a variable logical block size, in order to store such files reasonably efficiently. As we've seen, ZFS works fairly hard to only store the minimum amount of data it has to for small files (which it defines as files below recordsize).

(This model of why ZFS recordsize exists and operates the way it does didn't occur to me until I wrote yesterday's entry, but now that it has, I think I may finally have the whole thing sorted out in my head.)

solaris/ZFSRecordsizeAndChecksums written at 02:08:13; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.