How ZFS resilvering saved us

May 8, 2013

I've said nasty things about ZFS before and I'll undoubtedly say some in the future, but today, for various reasons, I want to take the positive side and talk about how ZFS has saved us. While there are a number of ways that ZFS routinely saves us in the small, there's been one big near miss that stands out.

Our fundamental environment is ZFS pools with vdevs of mirror pairs of disks. This setup costs space but, among other things, it's safe from multi-disk failures unless you lose both sides of a single mirror pair (at which point you've lost a vdev and thus the entire pool). One day we came very close to that: one side of a mirror pair died more or less completely and then, as we were resilvering on to a spare disk, the other side of the mirror started developing read errors. This was especially bad because read errors generally had the effect of locking up this particular fileserver (for reasons we don't understand). This was particularly bad because in Solaris 10 update 8, rebooting a locked up fileserver causes the pool resilver to lose all progress to date and start again from scratch.

ZFS resilver saved us here in two ways. The obvious way is that it didn't give up on the vdev when the second disk had some read errors. Many RAID systems would have shrugged their shoulders, declared the second disk bad too, and killed the RAID array (losing all data on it). ZFS was both able and willing to be selective, declaring only specific bits bad instead of ejecting the whole disk and destroying the pool.

(We were lucky in that no metadata was damaged, only file contents, and we had all of the damaged files in backups.)

The subtle way is how ZFS let us solve the problem of successfully resilvering the pool despite the fileserver's 'eventually lock up after enough read errors' behavior. Because ZFS told us what the corrupt files were when it found them and because ZFS only resilvers active data, we could watch the pool's status during the resilver, see what files were reported as having unrepairable problems, and then immediately delete them; this effectively fenced the bad spots on the disk off from the fileserver so that it wouldn't trip over them and explode (again). With a traditional RAID system and a whole-device resync it would have been basically impossible to fence the RAID resync away from the bad disk blocks. At a minimum this would have made the resync take much, much longer.

The whole experience was very nerve-wracking, because we knew we were only one glitch away from ZFS destroying a very large pool. But in the end ZFS got us through and we able to tell users that we had very strong assurances that no other data had been damaged by the disk problems.

Comments on this page:

From at 2013-05-10 04:27:36:

btw: if you have localized area on a disk which is bad and you can't read it, then take snapshot first and then remove files. That way zfs won't actually deallocate affected blocks so even if your apps writes new data zfs wouldn't re-use the affected blocks. Once you replace the bad disk then delete the snapshot.

By cks at 2013-05-10 09:41:51:

I don't believe that snapshots are a good idea in general here. Any resilver also resilvers and thus reads any snapshot, including bad blocks captured by files in that snapshot. In that situation, removing the affected files does no good except half-hiding them from the user.

(Often rewriting the bad sectors will cure them for various reasons; if not, you have a really sick disk.)

Written on 08 May 2013.
« Python's relative import problem
Thoughts on when to replace disks in a ZFS pool »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed May 8 00:15:12 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.