Why I am really unhappy with ZFS right now: a ZFS import failure
We almost lost a ZFS pool today. More accurately, we did lose a ZFS
pool, and then we were able to get it back because we were lucky enough
to have a Solaris 10 update 8 test machine handy. But except for that
bit of luck, we would be in the very uncomfortable position of telling
an important research group that for the first time ever we'd just lost
nearly a terabyte of data, and it wasn't even because of a hardware
failure, it was because of a software fault. And it wasn't caused by any
mistake we committed, not unless doing 'zpool export
' on a working
pool with all vdev devices intact and working is a mistake. Which
apparently it is, sometimes.
(Oh sure, we have backups for most of it. One day out of date, and do
you know how long it would take to restore almost a terabyte when it
involves multiple levels of incrementals? And perhaps Sun
Oracle support would have been able to get it back for us if the
research group could have waited a week or two or more to get their home
directories and email back. Hint: no.)
That ZFS almost ate a terabyte because it had a snit is only half of why I am really unhappy with ZFS right now. The other half is that ZFS is the perfect example of the new model of Unix systems and system administration, and this new model is busy screwing us.
The new model is non-transparent and tools-less. In the new model of
systems there is no level between 'sysadmin friendly' tools that don't
really tell you anything (such as ordinary zpool
) and going all of the
way down into low-level debuggers (such as zdb
) plus reading the fine
source code (where available). There is no intermediate level in the
new model, no way to get ZFS to tell you what it is doing, what it is
seeing, and just why something is wrong. Instead you have your choice
of 'something is wrong' or going in head first with developer-level
debuggers. The new model either is too complicated to even have
intermediate layers as such or just doesn't bother to tell you about
them.
(There are a lot of immediately complicated systems in modern Unixes; it's not just ZFS and Solaris.)
This stands in drastic contrast to the old Unix model for systems, where things came in multiple onion layers and you could peel back more and more layers to get more and more detail. The old model gave you progressive investigation and progressive learning; you could move up, step by step, to a deeper diagnosis and a deeper understanding of the system. The resulting learning curve was a slope, not a cliff.
(Sometimes these layers were implemented as separate programs and sometimes just as one program that gave you progressively more information).
The new model works okay when everything works or when all you have is monkeys who couldn't diagnose a problem anyways. But it fails utterly when you have real people (not monkeys) with a real problem, because it leaves us high and dry with nothing to do except call vendor support or try increasingly desperate hacks where we don't understand why they work or don't work because, of course, we're not getting anything from that new model black box except a green or a red light.
(Of course, vendor support often has no better tools or knowledge than we do. If anything they have less, because people with developer level knowledge get stolen from support in order to be made into actual developers.)
Sidebar: the details of what happened
Our production fileservers are Solaris 10 update 6 plus some patches. One of them had a faulted spares situation, so we scheduled a downtime in order to fix the situation by exporting and re-importing every pool. When we exported the first pool, it refused to re-import on the fileserver.
(This as despite the fact that the pool was working fine before being exported, and in fact the fileserver had brought it back up during a reboot not eight hours earlier, after an unplanned power outage due to a UPS failure. Note that since we have a SAN, the UPS power outage didn't touch any of the actual disks the fileserver was using.)
Import attempts reported 'one or more devices is currently unavailable'.
Running plain zpool import
showed a pool configuration that claimed
two of the six mirror vdevs had one side faulted with corrupted data,
and listed no spares (although it reported that additional missing
devices were known to be in the pool configuration). We knew and could
verify that all devices listed in the pool configuration were visible on
the fileserver.
Importing the pool on our test Solaris 10 update 8 machine worked perfectly; all devices in pool vdevs and all spares were present (and now healthy). When we exported it from the test machine and tried to import it on the production fileserver, we had the exact same import error all over again; our S10U6 machine just refused to touch it, despite having been perfectly happy with it less than an hour and a half earlier.
We were very fortunate in that we'd already done enough testing to decide that S10U8 was viable in production (with the machine's current patch set) and that the test machine was in a state where it was more or less production ready. Left with no choice, we were forced to abruptly promote the S10U8 machine to production status and migrate the entire (virtual) fileserver to it, slap plaster over various remaining holes, and hope that nothing explodes tomorrow.
|
|