Why I am really unhappy with ZFS right now: a ZFS import failure

May 28, 2010

We almost lost a ZFS pool today. More accurately, we did lose a ZFS pool, and then we were able to get it back because we were lucky enough to have a Solaris 10 update 8 test machine handy. But except for that bit of luck, we would be in the very uncomfortable position of telling an important research group that for the first time ever we'd just lost nearly a terabyte of data, and it wasn't even because of a hardware failure, it was because of a software fault. And it wasn't caused by any mistake we committed, not unless doing 'zpool export' on a working pool with all vdev devices intact and working is a mistake. Which apparently it is, sometimes.

(Oh sure, we have backups for most of it. One day out of date, and do you know how long it would take to restore almost a terabyte when it involves multiple levels of incrementals? And perhaps Sun Oracle support would have been able to get it back for us if the research group could have waited a week or two or more to get their home directories and email back. Hint: no.)

That ZFS almost ate a terabyte because it had a snit is only half of why I am really unhappy with ZFS right now. The other half is that ZFS is the perfect example of the new model of Unix systems and system administration, and this new model is busy screwing us.

The new model is non-transparent and tools-less. In the new model of systems there is no level between 'sysadmin friendly' tools that don't really tell you anything (such as ordinary zpool) and going all of the way down into low-level debuggers (such as zdb) plus reading the fine source code (where available). There is no intermediate level in the new model, no way to get ZFS to tell you what it is doing, what it is seeing, and just why something is wrong. Instead you have your choice of 'something is wrong' or going in head first with developer-level debuggers. The new model either is too complicated to even have intermediate layers as such or just doesn't bother to tell you about them.

(There are a lot of immediately complicated systems in modern Unixes; it's not just ZFS and Solaris.)

This stands in drastic contrast to the old Unix model for systems, where things came in multiple onion layers and you could peel back more and more layers to get more and more detail. The old model gave you progressive investigation and progressive learning; you could move up, step by step, to a deeper diagnosis and a deeper understanding of the system. The resulting learning curve was a slope, not a cliff.

(Sometimes these layers were implemented as separate programs and sometimes just as one program that gave you progressively more information).

The new model works okay when everything works or when all you have is monkeys who couldn't diagnose a problem anyways. But it fails utterly when you have real people (not monkeys) with a real problem, because it leaves us high and dry with nothing to do except call vendor support or try increasingly desperate hacks where we don't understand why they work or don't work because, of course, we're not getting anything from that new model black box except a green or a red light.

(Of course, vendor support often has no better tools or knowledge than we do. If anything they have less, because people with developer level knowledge get stolen from support in order to be made into actual developers.)

Sidebar: the details of what happened

Our production fileservers are Solaris 10 update 6 plus some patches. One of them had a faulted spares situation, so we scheduled a downtime in order to fix the situation by exporting and re-importing every pool. When we exported the first pool, it refused to re-import on the fileserver.

(This as despite the fact that the pool was working fine before being exported, and in fact the fileserver had brought it back up during a reboot not eight hours earlier, after an unplanned power outage due to a UPS failure. Note that since we have a SAN, the UPS power outage didn't touch any of the actual disks the fileserver was using.)

Import attempts reported 'one or more devices is currently unavailable'. Running plain zpool import showed a pool configuration that claimed two of the six mirror vdevs had one side faulted with corrupted data, and listed no spares (although it reported that additional missing devices were known to be in the pool configuration). We knew and could verify that all devices listed in the pool configuration were visible on the fileserver.

Importing the pool on our test Solaris 10 update 8 machine worked perfectly; all devices in pool vdevs and all spares were present (and now healthy). When we exported it from the test machine and tried to import it on the production fileserver, we had the exact same import error all over again; our S10U6 machine just refused to touch it, despite having been perfectly happy with it less than an hour and a half earlier.

We were very fortunate in that we'd already done enough testing to decide that S10U8 was viable in production (with the machine's current patch set) and that the test machine was in a state where it was more or less production ready. Left with no choice, we were forced to abruptly promote the S10U8 machine to production status and migrate the entire (virtual) fileserver to it, slap plaster over various remaining holes, and hope that nothing explodes tomorrow.


Comments on this page:

From 77.243.128.133 at 2010-05-28 04:31:40:

You are absolutely right - it's either dumb tool or debugger (which is a pain).

Not even a dtrace provider to shed some light on what is going on (and no - the fbt provider does not count since ZFS is very difficult to dtrace without full understanding of both the ZFS internals and the source code ... which you don't have access to on Solaris 10).

We've had a few unpleasant surprises with ZFS as well.

Deleting a 19 TB zvol live-locked the box for 24+ hours with NO WAY to stop zfs destroy (crt-c, kill, reboot, etc) - it just kept on going.

Oh and once it had finished it did not free up the space so 19 TB vanished into thin air - our support case with Sun has now taken 1 ½ month!!

(zdb consistently failed on the box in questions so there was no way around Sun support)

Delete operations in ZFS have always been a pain - if your browse through bugs.opensolaris.org you'll find all sorts of nasty surprises (ZFS dataset deletions do not free up space, live-lock boxes, deleting dedup enabled stuff panics boxes, etc).

Another box panic'ed when writing to a zvol via iSCSI - repeated scrubbing reported 1 chksum error. It turned out to be a side affect of bug 6911391 and the only solution according to Sun support was to delete that zvol. (Oh I am sorry Mr. Customer - I had to delete your backups because our vendor told me to ....)

We are actively moving away from S10 towards OpenSolaris - there we at least have the source code to help us out. With all the crap that Oracle has pulled WRT the OpenSolaris support subscriptions this also means that we now have to buy Oracle HW ... which is a pain ...

From 98.240.197.174 at 2010-05-28 07:49:21:

We had a four day system outage caused by a ZFS bug that mucked up Oracle archive logs after a crash/reboot.

Sol 10 u8 contains many ZFS bugs fixes, but from what we can tell, u9 has at least one more critical data loss related bug fix.

From 201.95.160.158 at 2010-05-28 10:40:29:

You've been bitten by a software bug in a very specific situation, so what ? If it worked in Sol10u8 then you should have upgraded first or at least simulated the situation in a test Sol10u6 environment.

Since you're running Solaris you probably have a support contract. Drop the bomb on Oracle's lap.

ZFS has probably saved you from more problems than other filesystems would have. I don't buy this story of a "new model". That's bullshit.

By cks at 2010-05-28 11:39:17:

Ahem, we're neither crazy nor stupid. We have done the export-import dance before in both testing and production, with the exact same software version that we're running now. In fact we did it before on this fileserver to fix a previous occurrence of the same problem (and we haven't changed the system since then). We had less evidence that pool export and import works on S10U8 than we had that it works in our S10U6 environment; the only reason we tried it at all is that we were desperate.

As for upgrading: upgrading production systems is a quite involved process, especially when you cannot trust the vendor not to screw things up (because they have before, repeatedly). It is also significantly more risky than redoing a proven procedure.

(As it happens we had already been planning a S10U8 upgrade, which is why we had a test S10U8 machine and migration procedures in place. But when we planned the downtime, we quite rationally decided that doing an upgrade to fix our faulted spares problem would be a lot more risky than repeating a proven procedure.)

Sun support was both slow and essentially useless the last time around. I have no reason to assume that they've improved since then.

You don't have to buy or not buy a story about a new model. The new model is sitting right in front of you; try to get zpool import to describe just what is wrong with a damaged pool configuration or a device it labels as 'FAULTED corrupted data'. You can't. It's a black box.

From 80.47.246.214 at 2010-05-29 07:07:39:

Every complex software has bugs - even products like VxVM/VxFS or UFS which are much longer in the market have bugs and sooner or later someone is bound to be hit by them. ZFS is no exception here but it works pretty good for most people. In your specific case due to lack of details it is hard to tell if you were hit by a bug in ZFS or some other system component or if your issue was due to mis-configuration. I suggest to raise the issue with Oracle's support so if there is a bug it will get fixed so it won't happen again for you and others.

Re black-box - to some degree you are right but when you really look at other volume managers and filesystems they are not really much better in practice in this respect. However maybe ZFS could be a little bit more verbose (when asked) in some failure situations.

From 69.113.211.148 at 2010-05-31 12:27:52:

I'll take dumb tools and debuggers and having to occasionally restore from backup anyday over the old way of running two-week-long fscks on 10 TB filesystems and waiting with bated breath for two days while an entire RAID array resynchronizes from scratch instead of performing a ten-second resilver.

The beauty of file-level backups, at least in a reasonably automated system like TSM/CommVault/NetBackup/whatever, is that you don't have to consider the restore an all-or-nothing operation. You can restore the small and important stuff like email first, then restore active research data for whatever people are immediately working on, then restore the near-line archival stuff when you get around to it. If you're running on a system with a small autoloader that can't simultaneously fit your fulls and incrementals, you may want to consider splitting your backup jobs to help your RTO for critical data in this kind of scenario.

--Jeff

Written on 28 May 2010.
« One benefit of relying on third-party (anti-)spam filtering
Some comments on spam scoring and anti-spam tools in general »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 28 02:00:51 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.