Why I am really unhappy with ZFS right now: a ZFS import failure

May 28, 2010

We almost lost a ZFS pool today. More accurately, we did lose a ZFS pool, and then we were able to get it back because we were lucky enough to have a Solaris 10 update 8 test machine handy. But except for that bit of luck, we would be in the very uncomfortable position of telling an important research group that for the first time ever we'd just lost nearly a terabyte of data, and it wasn't even because of a hardware failure, it was because of a software fault. And it wasn't caused by any mistake we committed, not unless doing 'zpool export' on a working pool with all vdev devices intact and working is a mistake. Which apparently it is, sometimes.

(Oh sure, we have backups for most of it. One day out of date, and do you know how long it would take to restore almost a terabyte when it involves multiple levels of incrementals? And perhaps Sun Oracle support would have been able to get it back for us if the research group could have waited a week or two or more to get their home directories and email back. Hint: no.)

That ZFS almost ate a terabyte because it had a snit is only half of why I am really unhappy with ZFS right now. The other half is that ZFS is the perfect example of the new model of Unix systems and system administration, and this new model is busy screwing us.

The new model is non-transparent and tools-less. In the new model of systems there is no level between 'sysadmin friendly' tools that don't really tell you anything (such as ordinary zpool) and going all of the way down into low-level debuggers (such as zdb) plus reading the fine source code (where available). There is no intermediate level in the new model, no way to get ZFS to tell you what it is doing, what it is seeing, and just why something is wrong. Instead you have your choice of 'something is wrong' or going in head first with developer-level debuggers. The new model either is too complicated to even have intermediate layers as such or just doesn't bother to tell you about them.

(There are a lot of immediately complicated systems in modern Unixes; it's not just ZFS and Solaris.)

This stands in drastic contrast to the old Unix model for systems, where things came in multiple onion layers and you could peel back more and more layers to get more and more detail. The old model gave you progressive investigation and progressive learning; you could move up, step by step, to a deeper diagnosis and a deeper understanding of the system. The resulting learning curve was a slope, not a cliff.

(Sometimes these layers were implemented as separate programs and sometimes just as one program that gave you progressively more information).

The new model works okay when everything works or when all you have is monkeys who couldn't diagnose a problem anyways. But it fails utterly when you have real people (not monkeys) with a real problem, because it leaves us high and dry with nothing to do except call vendor support or try increasingly desperate hacks where we don't understand why they work or don't work because, of course, we're not getting anything from that new model black box except a green or a red light.

(Of course, vendor support often has no better tools or knowledge than we do. If anything they have less, because people with developer level knowledge get stolen from support in order to be made into actual developers.)

Sidebar: the details of what happened

Our production fileservers are Solaris 10 update 6 plus some patches. One of them had a faulted spares situation, so we scheduled a downtime in order to fix the situation by exporting and re-importing every pool. When we exported the first pool, it refused to re-import on the fileserver.

(This as despite the fact that the pool was working fine before being exported, and in fact the fileserver had brought it back up during a reboot not eight hours earlier, after an unplanned power outage due to a UPS failure. Note that since we have a SAN, the UPS power outage didn't touch any of the actual disks the fileserver was using.)

Import attempts reported 'one or more devices is currently unavailable'. Running plain zpool import showed a pool configuration that claimed two of the six mirror vdevs had one side faulted with corrupted data, and listed no spares (although it reported that additional missing devices were known to be in the pool configuration). We knew and could verify that all devices listed in the pool configuration were visible on the fileserver.

Importing the pool on our test Solaris 10 update 8 machine worked perfectly; all devices in pool vdevs and all spares were present (and now healthy). When we exported it from the test machine and tried to import it on the production fileserver, we had the exact same import error all over again; our S10U6 machine just refused to touch it, despite having been perfectly happy with it less than an hour and a half earlier.

We were very fortunate in that we'd already done enough testing to decide that S10U8 was viable in production (with the machine's current patch set) and that the test machine was in a state where it was more or less production ready. Left with no choice, we were forced to abruptly promote the S10U8 machine to production status and migrate the entire (virtual) fileserver to it, slap plaster over various remaining holes, and hope that nothing explodes tomorrow.

Written on 28 May 2010.
« One benefit of relying on third-party (anti-)spam filtering
Some comments on spam scoring and anti-spam tools in general »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 28 02:00:51 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.