A little bit more on ZFS RAIDZ read performance

September 2, 2013

Back in this entry I talked about how all levels of ZFS RAIDZ had an unexpected read performance hit: they can't read less than a full stripe, so instead of the IOPS of N disks you get the IOPS of one disk. Well, it was recently pointed out to me that this is not quite correct. It is true that ZFS reads all of the stripe of a data block on reads; however, ZFS does not read the parity chunks (unless the block does not checksum correctly and needs to be repaired).

In normal RAIDZ pools the difference between 'all disks' and 'all disks except the parity disks' is small. If the parity for the stripes you're reading bits of are evenly spread over all of the disks, you might get somewhat more than one disk's IOPS on aggregate. Where this can matter is in very small RAIDZ pools, for example a four-disk RAIDZ2 pool. Here half your drives are parity drives for any particular data block and you may get something more like two disks of IOPS.

(A four-disk RAIDZ2 vdev is actually an interesting thing and potentially useful; it's basically a more resilient but potentially slower version of a two-vdev set of mirrors. You lose half of your disk space, as with mirroring, but you can withstand the failure of any two disks (unlike mirroring).)

To add some more RAIDZ parity trivia: RAIDZ parity is read and verified during scrubs (and thus likely resilvers), which is what you want. Data block checksums are as well of course, which means that reads on scrubs genuinely busy all drives.

Sidebar: small write blocks and read IOPS

Another way that you can theoretically get more than one disk's IOPS from a RAIDZ vdev is if the data was written in sufficiently small blocks. As I mentioned in passing here, ZFS doesn't have a fixed 'stripe size' and a small write will only put data (and parity) on less than N disks. In turn reading back this data will need less than N (minus parity) disks, meaning that if you have good luck you can read another small block from the other drives at the same time.

Since 'one sector' is the minimum amount of data to put on a single drive, this is probably much more likely now in the days of disks with 4096-byte sectors than it was on 512-byte sector drives. If you have a ten-disk RAIDZ2 on 4k disks, for example, it now takes a 32 KB data block to wind up on all 8 possible data drives.

(On 512-byte sector disks it would have only needed a 4KB data block.)

Written on 02 September 2013.
« Simple availability doesn't capture timing and the amount of warning
The current weak areas of ZFS on Linux »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 2 00:04:02 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.