solaris/ZFSImportFailure written at 02:00:51; Add Comment
Why I am really unhappy with ZFS right now: a ZFS import failure
We almost lost a ZFS pool today. More accurately, we did lose a ZFS
pool, and then we were able to get it back because we were lucky enough
to have a Solaris 10 update 8 test machine handy. But except for that
bit of luck, we would be in the very uncomfortable position of telling
an important research group that for the first time ever we'd just lost
nearly a terabyte of data, and it wasn't even because of a hardware
failure, it was because of a software fault. And it wasn't caused by any
mistake we committed, not unless doing '
(Oh sure, we have backups for most of it. One day out of date, and do
you know how long it would take to restore almost a terabyte when it
involves multiple levels of incrementals? And perhaps
That ZFS almost ate a terabyte because it had a snit is only half of why I am really unhappy with ZFS right now. The other half is that ZFS is the perfect example of the new model of Unix systems and system administration, and this new model is busy screwing us.
The new model is non-transparent and tools-less. In the new model of
systems there is no level between 'sysadmin friendly' tools that don't
really tell you anything (such as ordinary
(There are a lot of immediately complicated systems in modern Unixes; it's not just ZFS and Solaris.)
This stands in drastic contrast to the old Unix model for systems, where things came in multiple onion layers and you could peel back more and more layers to get more and more detail. The old model gave you progressive investigation and progressive learning; you could move up, step by step, to a deeper diagnosis and a deeper understanding of the system. The resulting learning curve was a slope, not a cliff.
(Sometimes these layers were implemented as separate programs and sometimes just as one program that gave you progressively more information).
The new model works okay when everything works or when all you have is monkeys who couldn't diagnose a problem anyways. But it fails utterly when you have real people (not monkeys) with a real problem, because it leaves us high and dry with nothing to do except call vendor support or try increasingly desperate hacks where we don't understand why they work or don't work because, of course, we're not getting anything from that new model black box except a green or a red light.
(Of course, vendor support often has no better tools or knowledge than we do. If anything they have less, because people with developer level knowledge get stolen from support in order to be made into actual developers.)
Sidebar: the details of what happened
Our production fileservers are Solaris 10 update 6 plus some patches. One of them had a faulted spares situation, so we scheduled a downtime in order to fix the situation by exporting and re-importing every pool. When we exported the first pool, it refused to re-import on the fileserver.
(This as despite the fact that the pool was working fine before being exported, and in fact the fileserver had brought it back up during a reboot not eight hours earlier, after an unplanned power outage due to a UPS failure. Note that since we have a SAN, the UPS power outage didn't touch any of the actual disks the fileserver was using.)
Import attempts reported 'one or more devices is currently unavailable'.
Importing the pool on our test Solaris 10 update 8 machine worked perfectly; all devices in pool vdevs and all spares were present (and now healthy). When we exported it from the test machine and tried to import it on the production fileserver, we had the exact same import error all over again; our S10U6 machine just refused to touch it, despite having been perfectly happy with it less than an hour and a half earlier.
We were very fortunate in that we'd already done enough testing to decide that S10U8 was viable in production (with the machine's current patch set) and that the test machine was in a state where it was more or less production ready. Left with no choice, we were forced to abruptly promote the S10U8 machine to production status and migrate the entire (virtual) fileserver to it, slap plaster over various remaining holes, and hope that nothing explodes tomorrow.
* * *
Atom feeds are available; see the bottom of most pages.