2013-05-27
Our situation with ZFS and 4 Kb physical sector disks
While I wrote up the general state of affairs with ZFS and 'advanced format' disks I've never described how this affects us in specific. The short version is that we are not in as much trouble as we might otherwise be because we're running ancient and somewhat under-functional software. You are in the maximal amount of trouble if your version of ZFS will refuse to add 4K sector disks to old pools and you have no way to lie to ZFS (or the kernel in general) about what the physical sector size of your disks is. Our situation is mostly the reverse of this.
First, our version of Solaris (Solaris 10 update 8 plus some patches)
turns out to be so old that it doesn't even know about physical sector
size as distinct from logical sector size. This is good in that it will
not even notice that it's mixing dissimilar disks but bad in that we now
have no way of creating new pools or vdevs with ashift=12. Second,
our ISCSI target software
doesn't export information about the physical sector size of disks that
it's making visible so even if our version of Solaris was aware of 4K
disks, it wouldn't see any.
The upshot of this is that we can freely add 4K disks to our existing
pools. The performance impact of this is not currently clear to me,
partly because
our environment is somewhat peculiar in ways that
make me think we'll experience less impact than normal people in this
situation. The bad news is that my initial testing on streaming IO shows
a visible difference in write performance, although not a huge one (I
need to put together a good random write test before I can have opinions
on that).
In the short term, we'll survive if we have to replace 512b disks with
4K disks; some things may run slower but so far it doesn't look like
they will be catastrophically slow. In the long term we need to replace
the entire fileserver infrastructure and migrate all of the data to new
pools created with ashift=12. We'd like to do it before we have to buy
too many 4K disks as replacement disks for existing pools.
(We always knew we had to replace the existing hardware and update the software someday, but it used to be less urgent and we expected that we could keep the pools intact and thus do the upgrade with minimal user impact. Our current vague timeline is now to do this sometime in 2014, depending on when we can get money for hardware and so on.)
PS: ZFS continues to be our best option for a replacement fileserver infrastructure, although we don't know what OS it'll be running on. Linux btrfs is the only other possible competitor and it's nowhere near ready yet. Our budget is unlikely to allow us to purchase any canned appliance-like solution.
2013-05-08
Thoughts on when to replace disks in a ZFS pool
One of the morals that you can draw from our near miss that I described in yesterday's entry, where we might have lost a large pool if things had gone a bit differently, is that the right time to replace a disk with read errors is TODAY. Do not wait. Do not put it off because things are going okay and you see no ZFS-level errors after the dust settles. Replace it today because you never know what is going to happen to another disk tomorrow.
Well, maybe. Clearly the maximally cautious approach is to replace a disk any time it reports a hard read error (ie one that is seen at the ZFS layer) or SMART reports an error. But the problem with this for us is that we'd be replacing a lot of disks and at least some of them may be good (or at least perfectly workable). For read errors, our experience is that some but not all reported read errors are transient errors in that they don't happen again if you do something like (re)scrub the pool. And SMART error reports seem relatively uncorrelated with actual errors reported by the backend kernels or seen by ZFS.
In theory you could replace these potentially questionable disks, test them thoroughly, and return them to your spares pool if they pass your tests. In practice this would add more and more questionable disks to your spares pool and, well, do you really trust them completely? I wouldn't. This leaves either demoting them to some less important role (if you have one that can use a potentially significant number of disks, and maybe you do) or trying to return them to the vendor for a warranty claim (and I don't know if the vendor will take them back under that circumstance).
I don't have a good answer to this. Our current (new) approach is to replace disks that have persistent read errors. On the first read error we clear the error and schedule a pool scrub; if the disk then reports more read errors (during the scrub, before the scrub, or in the next while after the scrub), it gets replaced.
(This updates some of our past thinking on when to replace disks. The general discussion there is still valid.)
How ZFS resilvering saved us
I've said nasty things about ZFS before and I'll undoubtedly say some in the future, but today, for various reasons, I want to take the positive side and talk about how ZFS has saved us. While there are a number of ways that ZFS routinely saves us in the small, there's been one big near miss that stands out.
Our fundamental environment is ZFS pools with vdevs of mirror pairs of disks. This setup costs space but, among other things, it's safe from multi-disk failures unless you lose both sides of a single mirror pair (at which point you've lost a vdev and thus the entire pool). One day we came very close to that: one side of a mirror pair died more or less completely and then, as we were resilvering on to a spare disk, the other side of the mirror started developing read errors. This was especially bad because read errors generally had the effect of locking up this particular fileserver (for reasons we don't understand). This was particularly bad because in Solaris 10 update 8, rebooting a locked up fileserver causes the pool resilver to lose all progress to date and start again from scratch.
ZFS resilver saved us here in two ways. The obvious way is that it didn't give up on the vdev when the second disk had some read errors. Many RAID systems would have shrugged their shoulders, declared the second disk bad too, and killed the RAID array (losing all data on it). ZFS was both able and willing to be selective, declaring only specific bits bad instead of ejecting the whole disk and destroying the pool.
(We were lucky in that no metadata was damaged, only file contents, and we had all of the damaged files in backups.)
The subtle way is how ZFS let us solve the problem of successfully resilvering the pool despite the fileserver's 'eventually lock up after enough read errors' behavior. Because ZFS told us what the corrupt files were when it found them and because ZFS only resilvers active data, we could watch the pool's status during the resilver, see what files were reported as having unrepairable problems, and then immediately delete them; this effectively fenced the bad spots on the disk off from the fileserver so that it wouldn't trip over them and explode (again). With a traditional RAID system and a whole-device resync it would have been basically impossible to fence the RAID resync away from the bad disk blocks. At a minimum this would have made the resync take much, much longer.
The whole experience was very nerve-wracking, because we knew we were only one glitch away from ZFS destroying a very large pool. But in the end ZFS got us through and we able to tell users that we had very strong assurances that no other data had been damaged by the disk problems.