2012-04-28
ZFS and various sorts of read errors
After I wrote about our experience of transient checksum errors in ZFS here, a commentator wrote (quoting me):
Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives.
Or there was some bit rot and it was fixed by copying good data from another mirrored drive (or re-creating it via RAIDZ) and replacing the bad data. Isn't that that the whole point of checksums and scrubs: go over all the bits to make sure things match?
My view is that the ice is dangerously thin here and that it's safer for us to assume that the checksum failures are not from disk bit rot.
As far as ZFS is concerned there are two sorts of read errors, hard read errors (where the underlying device or storage system reports an error and returns no data) and checksum errors (where the underlying storage claims to succeed but returns data that ZFS can see is incorrect). ZFS covers up both sorts of errors using whatever redundancy the pool (well, the vdev) has, but otherwise it treats them differently; it never attempts to repair read errors (although it's willing to try the read again later) while it immediately repairs bad checksums by rewriting the data in place.
My understanding of modern disks is that on-disk bit rot rarely goes undetected, since the actual on-disk data is protected by pretty good ECC checks (although they're not as strong as ZFS's checksums). When a disk detects a failed ECC (and cannot repair the damage), it returns a hard read error for that sector. You can still have various forms of in-flight corruption (sometimes as the data is being written, which means that the on-disk data is bad but probably passes the drive's ECC); all of these (broadly construed) read errors will result in nominally successful reads but ZFS checksum errors, which ZFS will then fix.
So the important question is: how many of the checksum errors that one sees are actually real read errors that were not recognized as such, either on-disk bit rot that still passed the drive's ECC checks or in-flight corruption inside the drive, and how many of them are from something else?
I don't know the answer to this, which is why I think the ice is thin. Right now my default assumption is that most or all of the actual drive bit rot is being detected as hard read errors; I make this partly because it's the safer assumption (since it means that we don't understand the causes of our checksum failures).
PS: ZFS's treatment of read errors means that in some ways you would be better off if you could tell your storage system to lie about them, so that instead of returning an actual error it would just log it and return random data. This would force a checksum error, causing ZFS to rewrite the data, which would force the sector to be rewritten and perhaps spared out.
(Yes, this is kind of a crazy idea.)
Sidebar: the purpose of scrubs
Scrubs do three things: they uncover hard read errors, they find and repair any checksum errors, and at a high level they verify that your data is actually redundant and tell you if it isn't. Because ZFS never rewrites hard read errors, scrubs do not necessarily restore full redundancy. But at least you know (via read errors that persist over repeated scrubs) that you have a potential problem that you need to do something about (ie you need to replace the disk with read errors).
(Because a ZFS scrub only reads live data, you know that any read error is in a spot that is actually being used for current data.)
Sidebar: the redundancy effects of read errors
If your vdevs are only single-redundant, a read error means that that particular piece of data is not redundant at all. If you have multi-way redundancy, eg from raidz2, and you have read errors on multiple disks I don't know if there's any way to know how much redundancy any particular piece of data has left. Note that ZFS does not always write a piece of data to the same offset on all disks, although it usually does.
(If you have multi-way redundancy and read errors on only a single disk, all of your data is still redundant although some of it is more exposed than it used to be.)
2012-04-26
When we replace disks in our ZFS fileserver environment
Recently, someone came here (well, here) as the result of a Google search for [zfs chksum non zero when to replace disk]. As it happens this is an issue that we've faced repeatedly so I can give you our answer. I don't claim that it's the right one but it's mostly worked for us.
First off, we have yet to replace a disk due to ZFS checksum errors. Our experience so far is that checksum errors are always transient and don't reappear after scrubs, so for us they've been a sign of (presumed) software weirdness instead of slowly failing disk drives. If we ever have a disk that repeatedly gets checksum errors we might consider it a sign of slow failure and preemptively replace the disk, but that hasn't happened so far.
The usual sign of a problematic disk here has been one or more persistent read errors. The cautious thing to do when this happens is to immediately replace the disk; for various reasons we don't usually do this if there are only a handful of read errors. Instead we mostly wait until one of three things: either there are more than a handful of read errors, the read error count is increasing, or it seems that handling the read errors is causing performance issues. For us, this balances the disruption of disk replacement (and the cost of disks) with the risk of serious data loss (and hasn't blown up in our faces yet).
(Because ZFS doesn't make any attempt to rewrite read errors (although I wish it would), they are basically permanent when they crop up. We do check reported read errors to see if the iSCSI backends are also reporting hard read errors, or if things look like transient problems.)
So that's my answer: don't replace on ZFS checksum errors unless there's something unusual or persistent about them and only replace on small numbers of read errors if you're cautious (and even then you should check to make sure that the actual disks are reporting persistent read errors). If we ever have hard write errors I expect that we'll replace the disk right away, but that hasn't happened yet.
(Based on our lack of write errors, you can probably guess that we have yet to have a disk die completely on us.)
We never reuse disks that we've pulled and replaced, even if they only had a few read errors. They are always either returned under the warranty or discarded. Yes, in theory they might be fine once those few bad sectors were remapped by being rewritten, but in practice the risk is not worth it.
Sidebar: why disk replacement is disruptive for us
Replacing disks is disruptive both to the sysadmins and to some degree to our users. Partly this is because our pools resilver slowly and with visible IO impact (note that ZFS resilvering is effectively seek limited in many cases and affects the whole pool). In our environment, replacing a physical disk the fully safe way can require up to six resilvers; if we restrict ourselves to one resilver at a time to keep the IO load down, that by itself can easily take all day. Another part of this is because pulling and replacing a disk is a manual procedure that takes a bunch of care and attention; for instance you need to make absolutely sure that you have matched up the iSCSI disk name with the disk that is reporting real errors on the iSCSI backend (despite a confusing mess of Linux names for disks) and then correctly mapped it to a physical disk slot and disk. This is not work that can be delegated (or scripted), so one of the core sysadmins is going to wind up babysitting any disk replacement.
(I'm sure that more upscale environments can just tell the software to turn on the fault light on the right disk drive enclosure and then send a minion to do a swap.)
2012-04-06
Why we haven't taken to DTrace
Recently I read Barriers to entry for DTrace adoption (via Twitter). As it happens I have an opinion on this, since we use Solaris and I have done a modest amount of things with DTrace. My belief is that DTrace has between two and three problems, depending on how you look at it.
(Part of our non-use of DTrace is that I once had a bad experience where starting to use DTrace on a production fileserver had immediate and significant bad effects. I've seen DTrace work okay since then but the uncertainty lingers, especially for writing my own DTrace scripts. But that's only a relatively modest part of it.)
First is that it's pretty hard to really use DTrace if you're not familiar with Solaris kernel internals. This issue takes some explanation (unless you've tried to use DTrace, in which case you're probably awfully familiar with it). What it boils down to is that there are really two DTraces, one for extracting subsystem information from the kernel and one for debugging the kernel, and the first one is incomplete.
In theory, DTrace lets you tap into all sorts of documented trace points that Solaris has put into the kernel, extracting a wide variety of interesting state from each of them (you can read the coverage of the various providers in the DTrace documentation). In practice, the Solaris kernel developers have never provided enough trace points with enough state information to be really useful by themselves. Instead they leave you to fall back on the 'kernel debugging' side of DTrace, where you can intercept and trace almost any function and extract random information from kernel memory provided that you know what you're looking for and what it means.
There are two problems with this (at least from my perspective). The first is that most of the really interesting uses of DTrace require using the kernel debugging DTrace and using the kernel debugging DTrace requires understanding the internals of the kernel. Ideally you need the code, which has always made things a little bit interesting (even before Solaris went closed source, OpenSolaris source did not exactly match Solaris (cf)). The second is that the DTrace documentation has never tried to address this split, instead throwing everything together in one big pile that (the last time I read it) was probably more oriented towards the person doing a deep dive into the kernel than a sysadmin trying to cleverly extract useful information from what trace points there are.
(One sign of the documentation quality is that there is a plethora of blog entries and web sites that try to explain clever DTrace tricks and how to use it to get interesting results. Personally I would like to see the documentation split into at least two parts, one for sysadmins and one for people debugging the kernel.)
Second (or third, depending on how you view the documentation problem) is that the DTrace scripting language has plenty of annoying awkwardness and pointless artificial limitations. These are situations where DTrace can do what you want but it forces you to jump through all sorts of hoops with no assistance; one example I've already mentioned is pulling information from user space. Many of these issues could be fixed with things like macros and other high level language features (or specific support for various higher level operations), but the DTrace authors seem to have deliberately chosen to keep much of the language at a low level. This is a virtue in a system language but DTrace isn't a system language, it's a way of specifying what information you want to extract from the system and when.
(One unkind way to put this is that the DTrace scripting language is mostly oriented around the needs of the people writing the kernel DTrace components instead of the people who are trying to use DTrace. It's easy to see how this happened but it doesn't make it right.)
These issues don't make DTrace impossible to use, and as a demonstration of that lots of people have written lots of very interesting and useful DTrace scripts. But they do significantly raise the barriers to entry for using DTrace; for most serious and interesting uses, you have to be prepared to learn kernel internals and slog through a certain amount of annoyance and make-work. It should not be any surprise that plenty of people haven't had problems that are sufficiently urgent and intractable to cause them to do this.
(It is not just that this stuff has to be learned. It's also that the learning simply takes time, probably significant time, and many people may not have that much time if they're dealing with a non-urgent problem.)
2012-04-02
The problem of ZFS pool and filesystem version numbers
ZFS pools and filesystems have version numbers for the straightforward reason; it lets ZFS augment or (carefully) change the on-disk storage format to add new features. Old versions of ZFS will know that they shouldn't touch a pool with a new version because they don't understand all of its metadata; new versions of ZFS will know that some pools can't have new metadata written to them and so on. All of this is very conventional.
In light of my previous entry on the several OS options for getting ZFS, it's occurred to me that this nice scheme has a little problem. To put it simply: if you have ZFS pool at version 55 is that the Solaris ZFS version 55, the Illumos version 55, or the FreeBSD version 55?
Right now it is always the Solaris version N, because both Illumos and FreeBSD stopped at the last OpenSolaris ZFS pool version. But this situation may not last forever; someday the Illumos people may well want to make a pool change that is not in Solaris, and they may also not want to reimplement some changes that created new Solaris pool version numbers. In fact the Illumos people may not be able to reimplement some Solaris changes; since Solaris is closed source they don't have source code, and Oracle may not release full documentation for the disk format and so on (or the changes may involve patented technology).
To make the problem worse, ZFS version numbers are a sequence where support for version N implies support for everything in version N-1, N-2, and so on. This means that even if Oracle was feeling friendly it can't just allocate a ZFS pool version for some Illumos change, because it would mean that when Oracle wanted to use version N+1 for its next change it would need to support the Illumos version N change.
The root cause of this issue is that when Sun designed ZFS version
numbers, they intended there to be a single authority for them, ie Sun
itself, and a single sequence of features. This single authority and
sequence is viable only so long as there is only one version of ZFS,
Sun's Oracle's. But once ZFS forks, which is what it
effectively has done, there is no single authority any more and all
of this explodes.
Sidebar: the problem for Illumos
In theory Illumos can half-solve this problem by defining a new ZFS property for the Illumos ZFS version; Illumos pools would then have a base ZFS version number of something or other (possibly set at the last official ZFS version that Illumos supports) plus their own Illumos version number. However, the problem with this is stopping Solaris systems from improperly importing Illumos ZFS pools, because after all Solaris doesn't know anything about the new Illumos version property.
I think that the only way out for Illumos is for them to create their own Illumos pool version property and then set the basic ZFS version to some implausibly high value, one that Solaris should never reach. Solaris systems will give the wrong error report, but there's only so much you can do.
(Illumos systems would always report the Illumos version number as the pool version number.)