Why losing part of a striped RAID is fatal even on smart filesystems
I wrote recently (here) that losing a chunk of striped or concatenated storage was essentially instantly fatal for any RAID system, smart ones like ZFS included. Once you start thinking about it, this is a bit peculiar for smart systems like ZFS. ZFS is generally self-healing, after all; why can't it at least try to heal from this loss, and why can't it organize itself so that this sort of loss is as unlikely as possible to be unrecoverable?
(In ZFS terms, I'm talking about the total loss of one vdev in the pool. This is a different thing from the failure of a RAID-5 or RAID-6 array when enough disks go bad at once.)
In theory, recovery from a chunk loss seems at least possible. Smart filesystems like ZFS already have a well developed idea of partial damage, where they can identify that certain files or entire directories are damaged or inaccessible, so in theory they could simply mark all pieces of the filesystem that depended on the destroyed chunk as damaged and keep going. Of course this might not work, for two different reasons.
First, the filesystem could have important top level metadata on the lost chunk. If you lose metadata, you lose everything under it; if you lose top-level metadata, that's everything in the filesystem. Second, the filesystem could have placed enough data and metadata on the lost chunk that basically everything in the filesystem is damaged to some degree. The extreme situation is classic striping, where any object over a certain small size is distributed over all chunks and so loss of one chunk damages almost all objects in the filesystem.
(If you are lucky, there is a chain of intact metadata that leads to the object so you can at least recover some of the data. But this is getting obscure.)
So, you say, why not change a filesystem to harden it against this sort of thing? The problem there is what this requires. You can get part of the way by having redundant copies of metadata on different chunks, but this still leaves you losing data from many or all sufficiently large files; since the data is the important thing, this may not really get you all that much. To do a really good job, you need to try to isolate the damage of a lost chunk by deliberately not striping file data across multiple chunks. This costs you performance in various ways.
In practice, no one wants their filesystems to do this. After all, if they want this sort of hardening and are willing to live with the performance impact, the simple approach is not to stripe the storage and just make separate filesystems.
(With that said, current smart filesystems could do better. ZFS makes redundant copies of metadata by default, but I believe it still simply gives up if a vdev fails rather than at least trying to let you read what's still there. This is sadly typical of the ZFS approach to problems.)