Revisiting what the ZFS recordsize is and what it does

May 8, 2020

I'm currently reading Jim Salter's ZFS 101—Understanding ZFS storage and performance, and got to the section on ZFS's important recordsize property, where the article attempts to succinctly explain a complicated ZFS specific thing. ZFS recordsize is hard to explain because it's relatively unlike what other filesystems do, and looking back I've never put down a unified view of it in one place.

The simplest description is that ZFS recordsize is the (maximum) logical block size of a filesystem object (a file, a directory, a whatever). Files smaller than recordsize have a single logical block that's however large it needs to be (details here); files of recordsize or larger have some number of recordsize logical blocks. These logical blocks aren't necessarily that large in physical blocks (details here); they may be smaller, or even absent entirely (if you have some sort of compression on and all of the data was zeros), and under some circumstances the physical block can be fragmented (these are 'gang blocks').

(ZFS normally doesn't fragment the physical block that implements your logical block, for various good reasons including that one sequential read or write is generally faster than several of them.)

However, this logical block size has some important consequences because ZFS checksums are for a single logical block. Since ZFS always verifies the checksum when you read data, it must read the entire logical block even if you ask only for a part of it; otherwise it doesn't have all the data it needs to compute the checksum. Similarly, it has to read the entire logical block even when your program is only writing a bit of data to part of it, since it has to update the checksum for the whole block, which requires the rest of the block's data. Since ZFS is a copy on write system, it then rewrites the whole logical block (into however large a physical block it now requires), even if you only updated a little portion of it.

Another consequence is that since ZFS always writes (and reads) a full logical block, it also does its compression at the level of logical blocks (and if you use ZFS deduplication, that also happens on a per logical block basis). This means that a small recordsize will generally limit how much compression you can achieve, especially on disks with 4K sectors.

(Using a smaller maximum logical block size may increase the amount of data that you can deduplicate, but it will almost certainly increase the amount of RAM required to get decent performance from deduplication. ZFS deduplication's memory requirements for good performance are why you should probably avoid it; making them worse is not usually a good idea. Any sort of deduplication is expensive and you should use it only when you're absolutely sure it's worth it for your case.)

Written on 08 May 2020.
« Linux software RAID resync speed limits are too low for SSDs
How big our fileserver environment is (as of May 2020) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 8 23:35:22 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.