2009-02-04
An alarming ZFS status message and what is usually going on with it
Suppose that you have a ZFS pool with redundancy (mirroring or ZFS's
version of RAID 5 or RAID 6), and that someday you run 'zpool status'
and see the alarming output:
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
(This has been re-linewrapped for my convenience.)
The rest of the zpool status output should have one or more disks with
non-zero CKSUM fields and a final line that reports 'errors: No known
data errors'.
What this generally really means is something like this:
ZFS has detected repairable checksum errors and has repaired them by rewriting the affected disk blocks. If the errors are from a slowly failing disk, replace the disk with '
zpool replace'; if they are instead from temporary problems in the storage system, clear this message and the error counts with 'zpool clear'. You may wish to check this pool for other latent errors with 'zpool scrub'.
(I have to admit that Sun's own error explanation page for this is pretty good, too. This is unfortunately somewhat novel, which explains why I didn't look at it before now.)
I assume that ZFS throws up this alarming status message even though it automatically handled the issue because it doesn't want to hide that a problem happened from you. While the problem might just be a temporary glitch (we've seen this a few times on our iSCSI based fileservers), it might instead be an indication of a more serious issue that you should look into, so at least you need to know that something happened.
(And even temporary glitches shouldn't happen all that often, or ideally at all; if they do, you have a problem somewhere.)
Sidebar: Our experience with these errors
We've seen a few of these temporary glitches with our iSCSI based
fileservers. So far our procedure to deal with
this is to note down at least which disk had the checksum errors
(sometimes we save the full 'zpool status' output for the pool),
'zpool clear' the errors on that specific disk, and then 'zpool
scrub' the pool. This should normally turn up a clean bill of health;
if it doesn't, I would re-clear and re-scrub and then panic if the
second scrub did not come back clean. (Okay, I wouldn't panic, but I
would replace the disk as fast as possible.)
On our fileservers, my suspicions are on the hardware or driver for the
onboard nVidia Ethernet ports. The fileservers periodically report that
they lost and then immediately regained the link on nge0, which is one
of the iSCSI networks, and usually report vhci_scsi_reset warnings
at the same time. Unfortunately, the ever so verbose Solaris fault
manager system does not log when the ZFS checksum errors are detected,
so we can't correlate them to nge0 link resets.
(In contributing evidence, the Linux iSCSI backends, running on very similar hardware, also had problems with their onboard nVidia Ethernet ports under sufficient load.)
2009-02-02
A grumpy remark about Solaris's scalability
For an operating system that is theoretically all about being ready to run on enterprise-sized systems (ie big ones), Solaris 10 software has an awfully bad habit of not dealing well with lots of (iSCSI) disks. I wouldn't be so put out about a single tool having this flaw, but Solaris programmers seem to make this sort of scaling mistake over and over.
The first case I ran into was the version of iscsiadm from the current
version of patch 119091 (for x86), where the programmers made 'iscsiadm
list target -S' stat() every file in /dev/ for every iSCSI LUN.
Since our systems have around 140 LUNs defined and roughly 16,000
non-directory entries in /dev, this works about as well as you'd
expect.
(Fortunately the previous version of 119091, 119091-31, does not have this problem and the -32 version doesn't add any bugfixes that we care about, so we reverted. And yes, this bug has been reported to Sun. Four months ago.)
Today's offender was Solaris Live Upgrade, where lucreate does
something mysterious that causes prtconf to loop repeatedly examining
lots of nodes in /dev. The result is a total stall when attempting to
create a new Live Upgrade boot environment (I left it sit for at least 45
minutes without any apparent progress).
It is possible that the Live Upgrade problem is specific to having lots
of iSCSI targets, but still, didn't it occur to any programmer at Sun
that repeatedly doing any operation to all of /dev or 'all disks in
/dev/' might not be the greatest idea?
Actually, I can answer that: I suspect that they never had the issue occur to them because they're using an abstraction layer and the underside of that abstraction layer has an unfortunate implementation, one that makes sense if you call it once or twice but not if you call it lots. Then the problem is that Sun programmers do not routinely test their programs on systems that are big enough to expose issues like this.
(Alternately, the problem is that they continue to turn out low-level implementations of abstractions that behave catastrophically badly if used repeatedly on big systems. If your sales pitch is 'enterprise ready', you should think about such scalability issues as a matter of course.)
2009-02-01
Understanding ZFS cachefiles in Solaris 10 update 6
The Solaris 10 update 6 introduced the new ZFS pool property
cachefile, and with it the idea of ZFS cachefiles. I misunderstood what these were before the S10U6 release, so I feel
like writing down what they are and how you can use them in a failover
environment.
To be able to quickly import a pool without scanning all of the devices
on your system, ZFS keeps a cache of information about the pool and
the devices its found on. Before S10U6 there was only one on your
system, /etc/zfs/zpool.cache, and ZFS made this serve double duty as
the list of pools to automatically import when the system booted. In
S10U6, things were changed so that pools can specify an alternate ZFS
cachefile instead of the system default one.
(Note that ZFS cachefiles don't contain information about filesystems inside the pools, so they don't change very often.)
Using an alternate ZFS cachefile has several effects:
- any pool not using the system default cachefile is not automatically imported on boot.
- if you have the cachefile for a pool, you can rapidly import it
even if an ordinary '
zpool import' would be achingly slow. - you can easily (and rapidly) import all pools in a cachefile (with
'
zpool import -c cachefile -a').
One tricky note: the cachefile file that zpool import uses does not
have to be the same file named by the pool's cachefile property. The
cachefile property only gives the file that is updated when you
change various pool configuration things. Crucially this includes
zpool export; if you export a pool, the pool is removed from its
cachefile.
(This is really annoying if you want to use ZFS cachefiles to speed up importing ZFS pools.)
Cachefiles can be copied from system to system, at least if the systems are x86 ones. (We have no Solaris 10 SPARC systems, so I can't test if it works cross-architecture.)
So one way to set up a failover environment goes like this:
- group pools together, for example all of the pools for a given
virtual fileserver, and give them all
the same non-default ZFS cachefile, for example
/var/local/zfs/fsN.cache. - replicate every group's ZFS cachefile to every physical fileserver
you have;
rsyncwill do. (Remember to explicitly resync after you make a pool configuration change, such as adding devices.) - when you have to bring up a virtual fileserver on another machine,
get all the pools up (and fast) by running '
zpool import -a' on the appropriate cachefile (in addition to higher level failover tasks like bringing up an IP alias). - on boot, use some external mechanism to decide what virtual fileservers
a physical machine owns and then invoke '
zpool import -a' on the appropriate cachefile or cachefiles.
The one gotcha is that because of the effects of zpool export,
bringing down a virtual fileserver in an orderly way can't really
involve exporting its pools, or at least requires tricking ZFS a lot. (I
think that you would want to copy the pre-shutdown ZFS cachefile
somewhere before all of the exports, then copy it back afterwards.)
If you just want fast pool imports for emergency failure and the
only ZFS pools you have are on shared storage, you don't even need
to set up alternate ZFS cachefiles for your ZFS pools; it's enough
to make sure that every system has a copy of every other system's
/etc/zfs/zpool.cache file under some convenient name.
(Once we upgrade to S10U6 on all of our fileservers, we will probably do at least this, just as a general precaution.)