An important way to get ZFS metadata corruption
In yesterday's entry on how to lose ZFS pools, I wrote:
The third way is to have corrupted metadata at the top of the pool. There are a number of ways that this can happen, but probably the most common one is running into a ZFS bug that causes it to write incorrect or bad data to disk [...]
I'm pretty sure I'm wrong about that and that there is a much more common way to get corrupt metadata: disk systems that lie to you plus inopportune crashes or powerdowns.
In the ZFS metadata update sequence, when ZFS writes metadata to disk and finishes by updating the uberblock to activate the new metadata it needs to be sure that the metadata has been written to disk before the uberblock is written. If the uberblock is instead written first you have a time window where the current uberblock points to 'invalid' metadata, in that it actually points to whatever random old garbage was in the disk sectors that the metadata is about to be written to.
(You also have similar ordering issues with ZFS label updates. And you need all of the metadata to get written out before the uberblock; if any piece of it isn't quite there yet, even something that's just pointed to by other metadata, you have the same problem.)
In theory ZFS uses write barriers and so on to forcefully flush the metadata to disk before it writes the uberblocks (and flushes the uberblocks to disk before it rewrites the disk labels), so the ordering works out right. Also in theory disks don't lie about having actually written things to that spinning rust (or flash memory or whatever). In practice, all sorts of disk systems lie in all sorts of circumstances, and sometimes they are caught out in that lie. When they get caught at exactly the wrong time, you get ZFS metadata corruption.
Given the prevalence of lying disk subsystems, I strongly suspect that this (and not ZFS code bugs) is by far the most common cause of pool loss from damaged ZFS metadata.
(Note that there are all sorts of ways and reasons for your disks to be wrong about this. For example, your expensive RAID card with its battery backed NVRAM might not have spotted that the battery is pretty close to dead.)