Why ZFS's raidz design decision is sensible (or at least rational)
Given the downsides covered in yesterday's entry, ZFS's decision to stripe data blocks across all of the disks in a raidz probably sounds rather odd. However, it actually is a sensible decision given ZFS's overall design goals. The easiest way to see why is to contemplate the problems posed by an alternate design.
Suppose that ZFS did not turn a single data block into a 'parity stripe', but instead had the idea of what I will call a 'block set', a group of N data blocks and either one or two parity blocks. Then you would only need to read the full block set if you have to try to reconstruct data; during normal reads you could read a single data block by itself from a single disk and still verify its checksum.
Now consider how this interacts with the twin ZFS design goals of never updating a data block in place, and not having the RAID-5 write hole problem. When a data block changes, it must be rewritten elsewhere, and so this orphans the old data block. However, you cannot reuse that space, because that would invalidate the block set's parity blocks unless you updated them in place, and updating them in place both breaks a ZFS rule and creates the RAID-5 write hole problem.
So the conclusion is that you can only reclaim the space in a block set when all of the data blocks in the block set are orphaned and dead. As a corollary, you must write all of the block set at the same time. This is not fatal, since it is more or less equivalent to having a data block that is as big as the whole block set. But it certainly would complicate ZFS's design, and I think that skipping that complication is a rational choice.
(Because the block set approach has more data blocks, it also uses up more space for metadata; instead of one pointer and checksum for the entire parity stripe, you need N pointers and checksums, one for each data block in the block set.)