A read performance surprise with ZFS's raidz and raidz2

September 14, 2008

Sun's ZFS contains a performance surprise for people using its version of RAID-5 and RAID-6, which ZFS calls raidz and raidz2. To understand what is going on, it is necessary to start with some basic ZFS ideas.

One of the things that ZFS is worried about is disk corruption. To deal with this, ZFS famously checksums everything that it writes (both filesystem metadata and file data) and then verifies that checksum when you read the things back; specifically, ZFS computes and checks a separate checksum for every data block. One consequence of this is that ZFS must always read a whole data block, even if the user level code only asked for a single byte. (This is pretty typical behavior for filesystems, and generally doesn't matter; modern disks care far more about seeks than about the amount of data being transfered.)

(You can read more about this here.)

Now we come to the crucial decision ZFS has made for raidz and raidz2: in raidz and raidz2, the data block is striped across all of the disks. Instead of a model where a parity stripe is a bunch of data blocks, each with an independent checksum, ZFS stripes a single data block (and its parity), with a single checksum, across all the disks (or as many of them as necessary).

This is a rational implementation decision, but when combined with the need to verify checksums, it has an important consequence: in ZFS, reads always involve all disks, because ZFS always must verify the data block's checksum, which requires reading all of the data block, which is spread across all of the drives. This is unlike normal RAID-5 or RAID-6, in which a small enough read will only touch one drive, and means that adding more disks to a ZFS raidz pool does not increase how many random reads you can do per second.

(A normal RAID-5 or RAID-6 array has a (theoretical) random read IO capacity equal to the sum of the random IO operations rate of each of the disks in the array, and so adding another disk adds its IOPs per second to your read capacity. A ZFS raidz or raidz2 pool instead has a capacity equal to the slowest disk's IOPs per second, and adding another disk does nothing to help. Effectively a raidz ZFS gives you a single disk's read IOPs per second rate.)

Assuming that you can afford the disk space loss, you can somewhat improve this situation by creating your pools from several smaller raidz or raidz2 vdevs, instead of from one large vdev that has all of the drives. This doesn't get you the same random read IO data rate as normal array does, but at least it will get you a higher rate than a single drive would. (You effectively get one drive's data rate per vdev.)

(Credit where credit is due: I didn't discover this on my own; I gathered it more or less by osmosis from various discussions on the ZFS mailing list.)

Written on 14 September 2008.
« 999 days is not forever
Why ZFS's raidz design decision is sensible (or at least rational) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Sep 14 01:55:24 2008
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.