Thinking about why ZFS only does IO in
recordsize blocks, even random IO
As I wound up experimentally verifying,
in ZFS all files are stored as a single block of varying size up
to the filesystem's
recordsize, or using multiple
blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a
file, both for reads and especially for writes. Since the default
recordsize is 128 Kb, this means that many files of interest are
recordsize blocks and thus all IO to them is done in 128 Kb
units, even if you're only reading or writing a small amount of
On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.
I wrote before about the ZFS DVA, which
is ZFS's equivalent of a block number and tells you where to find
data. ZFS DVAs are embedded into 'block pointers', which you can
find described in spa.h.
One of the fields of the block pointer is the ZFS block checksum.
Since this is part of the block pointer, it is a checksum over all
of the (logical) data in the block, which is up to
recordsize. Once a file reaches
recordsize bytes long,
all blocks are the same size, the
Since the ZFS checksum is over the entire logical block, ZFS has
to fetch the entire logical block in order to verify the checksum
on reads, even if you're only asking for 4 Kbytes out of it. For
writes, even if ZFS allowed you to have different sized logical
blocks in a file, you'd need to have the original
available in order to split it and you'd have to write all of it
back out (both because ZFS never overwrites in place and because
the split creates new logical blocks, which need new checksums).
Since you need to add new logical blocks, you might have a ripple
effect in ZFS's equivalent of indirect blocks, where they must
expand and shuffle things around.
(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)
In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.
(The one thing that might make ZFS nicer in the face of some access
patterns where this matters is the ability to set the
on a per-file basis instead of just a per-filesystem basis. But I'm
not sure how important this would be; the kind of environments where
it really matters are probably already doing things like putting
database tables on their own filesystems anyway.)
PS: This feels like an obvious thing once I've written this entry
all the way through, but the ZFS
recordsize issue has been one
of my awkward spots for years, where I didn't really understand why
it all made sense and had to be the way it was.
PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.