The different ways that you can lose a ZFS pool

May 10, 2011

There are at least three different general ways that you can lose a ZFS pool.

The straightforward way that everyone knows about is for you to lose a top level vdev, ie to lose a non-redundant disk, or all of the disks in a mirror set, or enough disks in a raidzN (two disks for raidz1, three for a raidz2, etc). Losing a chunk of striped or concatenated storage is essentially instant death for basically any RAID system and ZFS is no exception here.

(I don't know and haven't tested if you can recover your pool should you return enough of the missing disks to service, or if ZFS immediately 'poisons' the remaining good disks the moment it notices the problem. I would hope the former, but ZFS has let me down before.)

The second way, which may have been fixed by now, is to lose a vdev that was a separate ZIL log device (and I think perhaps a L2ARC device) and then reboot or export the pool before removing the dead vdev from the pool configuration. This failure mode is caused by how ZFS validates that it has found the full and correct pool configuration without storing a copy of the pool configuration in the uberblock. Basically, each vdev in the pool has a ZFS GUID and the uberblock has a checksum of all of them together. If you try to assemble a pool with an incomplete set of vdevs, the checksum of their GUIDs will not match the checksum recorded in the uberblock and the ZFS code rejects the attempted pool configuration. This is all well and good until you lose a vdev of something without data on it (such as a ZIL log device) that ZFS still includes in the uberblock vdev GUID checksum.

(One unfortunate aspect of this design decision is that ZFS doesn't necessarily know which pieces of your pool are missing. All it knows is that you have an incomplete configuration because the GUID checksums don't match.)

The third way is to have corrupted metadata at the top of the pool. There are a number of ways that this can happen, but probably the most common one is running into a ZFS bug that causes it to write incorrect or bad data to disk (you can also accidentally misuse ZFS). I believe that ZFS can recover from a certain amount of damaged metadata that is relatively low down the ZFS metadata and filesystem tree; you'll lose access to some of your files, but the pool will stay intact. However, if there's damage something sufficiently close to the root of the ZFS pool metadata, that's it; ZFS throws up its hands (and sometimes panics your machine) despite most of your data being intact and often relatively findable.

(Roughly speaking there are two sorts of metadata damage, destroyed metadata and corrupted metadata. Destroyed metadata has a bad ZFS block checksum; corrupted metadata checksums correctly but has contents that ZFS chokes on, often with kernel panics.)

Update: what I said here about the leading causes of corrupted metadata is probably wrong. See ZFSLosingPoolsWaysII.

These days, ZFS has a recovery method for certain sorts of metadata corruption; how it works and what its limitations are is beyond the scope of this entry.


Comments on this page:

From 70.26.88.153 at 2011-05-10 09:54:01:

Starting in ZFSv19 (which is available in Solaris 10 9/10 ("Update 9")) you can remove separate log devices ('slog'):

  • In a mirrored log configuration, you can always detach (unmirror) devices, but as mentioned above, you cannot remove your last unmirrored log device prior to pool version 19.
  • Log devices can be unreplicated or mirrored, but RAIDZ is not supported for log devices.
  • Mirroring the log device is recommended. Prior to pool version 19, if you have an unmirrored log device that fails, your whole pool might be lost or you might lose several seconds of unplayed writes, depending on the failure scenario.
  • In current releases, if an unmirrored log device fails during operation, the system reverts to the default behavior, using blocks from the main storage pool for the ZIL, just as if the log device had been gracefully removed via the "zpool remove" command.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices

You can either replace the missing device or do a "zpool online/clear":

   http://download.oracle.com/docs/cd/E19253-01/819-5461/ghbxs/

This functionality was added in build 125 in Nevada/OpenSolaris:

   http://hub.opensolaris.org/bin/view/Community+Group+zfs/19

Another useful feature added in S10u9 was pool recovery:

   http://download.oracle.com/docs/cd/E19253-01/819-5461/gjhef/
   http://blogs.oracle.com/video/entry/oracle_solaris_zfs_pool_recovery

The three failure modes are in the official Oracle documentation FWIW:

   http://download.oracle.com/docs/cd/E19253-01/819-5461/gavwg/
Written on 10 May 2011.
« X and the misleading claim of 'mechanisms not policy'
A brief summary of how ZFS updates (top-level) metadata »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue May 10 00:23:03 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.