Thinking about why ZFS only does IO in recordsize blocks, even random IO

April 22, 2018

As I wound up experimentally verifying, in ZFS all files are stored as a single block of varying size up to the filesystem's recordsize, or using multiple recordsize blocks. As is perhaps less well known, a ZFS logical block is the minimum size of IO to a file, both for reads and especially for writes. Since the default recordsize is 128 Kb, this means that many files of interest are have recordsize blocks and thus all IO to them is done in 128 Kb units, even if you're only reading or writing a small amount of data.

On the one hand, this seems a little bit crazy. The time it takes to transfer 128 Kb over a SATA link is not always something that you can ignore, and on SSDs larger writes can have a real impact. On the other hand, I think that this choice is more or less forced by some decisions that ZFS has made. Specifically, the ZFS checksum covers the entire logical block, and ZFS's data structure for 'where you find things on disk' is also based on logical blocks.

I wrote before about the ZFS DVA, which is ZFS's equivalent of a block number and tells you where to find data. ZFS DVAs are embedded into 'block pointers', which you can find described in spa.h. One of the fields of the block pointer is the ZFS block checksum. Since this is part of the block pointer, it is a checksum over all of the (logical) data in the block, which is up to recordsize. Once a file reaches recordsize bytes long, all blocks are the same size, the recordsize.

Since the ZFS checksum is over the entire logical block, ZFS has to fetch the entire logical block in order to verify the checksum on reads, even if you're only asking for 4 Kbytes out of it. For writes, even if ZFS allowed you to have different sized logical blocks in a file, you'd need to have the original recordsize block available in order to split it and you'd have to write all of it back out (both because ZFS never overwrites in place and because the split creates new logical blocks, which need new checksums). Since you need to add new logical blocks, you might have a ripple effect in ZFS's equivalent of indirect blocks, where they must expand and shuffle things around.

(If you're not splitting the logical block when you write to only a part of it, copy on write means that there's no good way to do this without rewriting the entire block.)

In fact, the more I think about this, the more it seems that having multiple (logical) block sizes in a single file would be the way to madness. There are so many things that get complicated if you allow variable block sizes. These issues can be tackled, but it's simpler not to. ZFS's innovation is not that it insists that files have a single block size, it is that it allows this block size to vary. Most filesystems simply set the block size to, say, 4 Kbytes, and live with how large files have huge indirect block tables and other issues.

(The one thing that might make ZFS nicer in the face of some access patterns where this matters is the ability to set the recordsize on a per-file basis instead of just a per-filesystem basis. But I'm not sure how important this would be; the kind of environments where it really matters are probably already doing things like putting database tables on their own filesystems anyway.)

PS: This feels like an obvious thing once I've written this entry all the way through, but the ZFS recordsize issue has been one of my awkward spots for years, where I didn't really understand why it all made sense and had to be the way it was.

PPS: All of this implies that if ZFS did split logical blocks when you did a partial write, the only time you'd win would be if you then overwrote what was now a single logical block a second time. For example, if you created a big file, wrote 8 Kb to a spot in it (splitting a 128 Kb block into several new logical blocks, including an 8 Kb one for the write you just did), then later wrote exactly 8 Kb again to exactly that spot (overwriting only your new 8 Kb logical block). This is probably obvious too but I wanted to write it out explicitly, if only to convince myself of the logic.

Written on 22 April 2018.
« The increasingly surprising limits to the speed of our Amanda backups
ZFS's recordsize as an honest way of keeping checksum overhead down »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Apr 22 01:07:47 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.