An additional small detail of how writes work on ZFS raidzN pools

March 14, 2016

Back in How writes work on ZFS raidzN pools I wrote up how ZFS doesn't always do what's usually called 'full stripe writes', unlike normal RAID-5/6/etc systems. This matters because if you write data in small chunks you can use up more space than you expect, especially on 4k physical sector size disks (apparently zvols with a 4K or 8K record size are especially terrible for this; see eg this ZFS on Linux issue report).

Recently, I was reading Matthew Ahrens' ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ and learned another small but potentially important detail about how ZFS does raidzN writes. It turns out that ZFS requires all allocations to be multiples of N+1 blocks, so it rounds everything up to the nearest N+1 block boundary. This is regardless of how many disks you have in the raidzN vdev; if you have a raidz2 pool, for example, it can allocate 9 blocks or 12 blocks for a single write but never 10 or 11 blocks.

(Note that this is the allocation size including the raidzN parity blocks, not the user level data alone.)

At first this might seem kind of crazy, but as Matthew Ahrens explains, it more or less makes sense. The minimum write size in a raidzN pool is one data block plus N parity blocks, ie N+1 blocks in total. By rounding allocations up to this boundary, ZFS makes life somewhat easier on itself; any chunk of free space is always guaranteed to fit at least one data block, no matter how space is allocated. No matter how things are allocated and freed, ZFS will never be left with 'runt' free space that is too small to be used.

(This is free space as ZFS sees it, ie free space in a space map, which is what ZFS scans when it wants to allocate space. There will be some amount of irregular space that is 'free' in the sense that it is not used because it's rounded-up blocks, but ZFS doesn't have to keep track of that as free space. Instead ZFS just ignores it entirely, or more exactly marks it as used space.)

As with partial stripe writes, this does interact with 4k sector drives to potentially use more space, especially for higher raidzN settings. However, how much extra space gets used is going to be very dependent on what size your writes are.

(The good news is that minimum-sized objects won't experience any extra space usage as a result of this, since they're already one data block plus N parity blocks.)

Written on 14 March 2016.
« How RPM handles configuration files, both technically and socially
I wish I could split up code more easily in Python »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 14 23:25:49 2016
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.