2011-05-10
An important way to get ZFS metadata corruption
In yesterday's entry on how to lose ZFS pools, I wrote:
The third way is to have corrupted metadata at the top of the pool. There are a number of ways that this can happen, but probably the most common one is running into a ZFS bug that causes it to write incorrect or bad data to disk [...]
I'm pretty sure I'm wrong about that and that there is a much more common way to get corrupt metadata: disk systems that lie to you plus inopportune crashes or powerdowns.
In the ZFS metadata update sequence, when ZFS writes metadata to disk and finishes by updating the uberblock to activate the new metadata it needs to be sure that the metadata has been written to disk before the uberblock is written. If the uberblock is instead written first you have a time window where the current uberblock points to 'invalid' metadata, in that it actually points to whatever random old garbage was in the disk sectors that the metadata is about to be written to.
(You also have similar ordering issues with ZFS label updates. And you need all of the metadata to get written out before the uberblock; if any piece of it isn't quite there yet, even something that's just pointed to by other metadata, you have the same problem.)
In theory ZFS uses write barriers and so on to forcefully flush the metadata to disk before it writes the uberblocks (and flushes the uberblocks to disk before it rewrites the disk labels), so the ordering works out right. Also in theory disks don't lie about having actually written things to that spinning rust (or flash memory or whatever). In practice, all sorts of disk systems lie in all sorts of circumstances, and sometimes they are caught out in that lie. When they get caught at exactly the wrong time, you get ZFS metadata corruption.
Given the prevalence of lying disk subsystems, I strongly suspect that this (and not ZFS code bugs) is by far the most common cause of pool loss from damaged ZFS metadata.
(Note that there are all sorts of ways and reasons for your disks to be wrong about this. For example, your expensive RAID card with its battery backed NVRAM might not have spotted that the battery is pretty close to dead.)
A brief summary of how ZFS updates (top-level) metadata
As is common in filesystems, a ZFS pool's metadata and data lives in essentially a tree; at the top of the tree is the ZFS uberblock and the actual root metaobject set (which is pointed to by the ZFS uberblock). Because ZFS is a copy on write filesystem, none of this metadata is overwritten in place. Instead, all metadata is written to a new location, all the way up to the uberblock. This is simple for everything except the uberblock; you write the new version of the metadata to some suitable bit of free space, then update its parent to point to the new location. However, the uberblock is the root, with no parent to update and no ability to rove randomly around the free space.
(I mentioned this in passing before, but I never described the details.)
ZFS does metadata updates, or at least uberblock updates, in what it calls a 'transaction group', which is often abbreviated as 'txg' in ZFS lingo. Each transaction group is numbered in an increasing sequence (which I believe is strictly monotonic, but I don't know for sure), and the ZFS uberblock has the transaction group number as well as a pointer to the current root metaobject set (well, to the transaction group's root metaobject set; it becomes the current root MOS when the transaction group commits).
In order to get copy on write for uberblocks, uberblocks are actually stored not in a single spot but in an array of 128 uberblock slots. Each successive transaction group writes its uberblock to the next slot (wrapping around at the end). When ZFS is bringing up a pool, it locates the current uberblock by scanning all 128 slots to find the uberblock with the highest transaction group number, ignoring invalid uberblocks entirely (uberblocks have both magic numbers and checksums).
('Scanning all 128 slots' is a little imprecise as a description, since uberblocks are highly replicated. Each disk in the pool has four copies of this 128 slot array, and I believe that bringing up a pool scans all possible uberblock copies and picks the best one. As I discovered earlier, it's valid for uberblock copies on some devices to be out of date.)
(This is all reasonably well known in the heavy-duty ZFS community, but I wanted to write it down in one easy to find spot for my future reference.)
Sidebar: the uberblock versus the root metaobject set
You might wonder why the uberblock is separate from the root metaobject set, since the two are intimately tied together. My understanding is the split is so that the uberblock can be a straightforward fixed-size structure while the root MOS is potentially variable sized and flexible. Uberblocks are also actually quite small; the official size is 1 Kbyte (although not all of that space is used). I believe that a root MOS often is much larger than that.
(This 1 Kb size is somewhat unfortunate given that disks are moving to a 4 Kb sector size. You'd like there to be only one uberblock in a given sector, because disks only really write in full sectors. With 1k uberblocks on a 4k sector disk, a write that goes wrong could destroy not just the uberblock you're writing but also another three, or at least up to three depending on where in the 4k real sector you're writing.)
Sidebar: some useful references for this stuff
- Eric Schrock on ZFS pool import, which winds up talking a bit about how ZFS pool metadata is stored.
- Matthew Ahrens on snapshots and uberblocks.
- a version of the documentation for the ZFS on-disk format [PDF] (also).
- A two part series on watching uberblock updates with DTrace, part 1 and part 2.
- Max Bruning's ZFS On-Disk Data Walk (or: Where's my Data?) [PDF] (also).
The different ways that you can lose a ZFS pool
There are at least three different general ways that you can lose a ZFS pool.
The straightforward way that everyone knows about is for you to lose a top level vdev, ie to lose a non-redundant disk, or all of the disks in a mirror set, or enough disks in a raidzN (two disks for raidz1, three for a raidz2, etc). Losing a chunk of striped or concatenated storage is essentially instant death for basically any RAID system and ZFS is no exception here.
(I don't know and haven't tested if you can recover your pool should you return enough of the missing disks to service, or if ZFS immediately 'poisons' the remaining good disks the moment it notices the problem. I would hope the former, but ZFS has let me down before.)
The second way, which may have been fixed by now, is to lose a vdev that was a separate ZIL log device (and I think perhaps a L2ARC device) and then reboot or export the pool before removing the dead vdev from the pool configuration. This failure mode is caused by how ZFS validates that it has found the full and correct pool configuration without storing a copy of the pool configuration in the uberblock. Basically, each vdev in the pool has a ZFS GUID and the uberblock has a checksum of all of them together. If you try to assemble a pool with an incomplete set of vdevs, the checksum of their GUIDs will not match the checksum recorded in the uberblock and the ZFS code rejects the attempted pool configuration. This is all well and good until you lose a vdev of something without data on it (such as a ZIL log device) that ZFS still includes in the uberblock vdev GUID checksum.
(One unfortunate aspect of this design decision is that ZFS doesn't necessarily know which pieces of your pool are missing. All it knows is that you have an incomplete configuration because the GUID checksums don't match.)
The third way is to have corrupted metadata at the top of the pool. There are a number of ways that this can happen, but probably the most common one is running into a ZFS bug that causes it to write incorrect or bad data to disk (you can also accidentally misuse ZFS). I believe that ZFS can recover from a certain amount of damaged metadata that is relatively low down the ZFS metadata and filesystem tree; you'll lose access to some of your files, but the pool will stay intact. However, if there's damage something sufficiently close to the root of the ZFS pool metadata, that's it; ZFS throws up its hands (and sometimes panics your machine) despite most of your data being intact and often relatively findable.
(Roughly speaking there are two sorts of metadata damage, destroyed metadata and corrupted metadata. Destroyed metadata has a bad ZFS block checksum; corrupted metadata checksums correctly but has contents that ZFS chokes on, often with kernel panics.)
Update: what I said here about the leading causes of corrupted metadata is probably wrong. See ZFSLosingPoolsWaysII.
These days, ZFS has a recovery method for certain sorts of metadata corruption; how it works and what its limitations are is beyond the scope of this entry.