Wandering Thoughts archives

2010-05-28

Why I am really unhappy with ZFS right now: a ZFS import failure

We almost lost a ZFS pool today. More accurately, we did lose a ZFS pool, and then we were able to get it back because we were lucky enough to have a Solaris 10 update 8 test machine handy. But except for that bit of luck, we would be in the very uncomfortable position of telling an important research group that for the first time ever we'd just lost nearly a terabyte of data, and it wasn't even because of a hardware failure, it was because of a software fault. And it wasn't caused by any mistake we committed, not unless doing 'zpool export' on a working pool with all vdev devices intact and working is a mistake. Which apparently it is, sometimes.

(Oh sure, we have backups for most of it. One day out of date, and do you know how long it would take to restore almost a terabyte when it involves multiple levels of incrementals? And perhaps Sun Oracle support would have been able to get it back for us if the research group could have waited a week or two or more to get their home directories and email back. Hint: no.)

That ZFS almost ate a terabyte because it had a snit is only half of why I am really unhappy with ZFS right now. The other half is that ZFS is the perfect example of the new model of Unix systems and system administration, and this new model is busy screwing us.

The new model is non-transparent and tools-less. In the new model of systems there is no level between 'sysadmin friendly' tools that don't really tell you anything (such as ordinary zpool) and going all of the way down into low-level debuggers (such as zdb) plus reading the fine source code (where available). There is no intermediate level in the new model, no way to get ZFS to tell you what it is doing, what it is seeing, and just why something is wrong. Instead you have your choice of 'something is wrong' or going in head first with developer-level debuggers. The new model either is too complicated to even have intermediate layers as such or just doesn't bother to tell you about them.

(There are a lot of immediately complicated systems in modern Unixes; it's not just ZFS and Solaris.)

This stands in drastic contrast to the old Unix model for systems, where things came in multiple onion layers and you could peel back more and more layers to get more and more detail. The old model gave you progressive investigation and progressive learning; you could move up, step by step, to a deeper diagnosis and a deeper understanding of the system. The resulting learning curve was a slope, not a cliff.

(Sometimes these layers were implemented as separate programs and sometimes just as one program that gave you progressively more information).

The new model works okay when everything works or when all you have is monkeys who couldn't diagnose a problem anyways. But it fails utterly when you have real people (not monkeys) with a real problem, because it leaves us high and dry with nothing to do except call vendor support or try increasingly desperate hacks where we don't understand why they work or don't work because, of course, we're not getting anything from that new model black box except a green or a red light.

(Of course, vendor support often has no better tools or knowledge than we do. If anything they have less, because people with developer level knowledge get stolen from support in order to be made into actual developers.)

Sidebar: the details of what happened

Our production fileservers are Solaris 10 update 6 plus some patches. One of them had a faulted spares situation, so we scheduled a downtime in order to fix the situation by exporting and re-importing every pool. When we exported the first pool, it refused to re-import on the fileserver.

(This as despite the fact that the pool was working fine before being exported, and in fact the fileserver had brought it back up during a reboot not eight hours earlier, after an unplanned power outage due to a UPS failure. Note that since we have a SAN, the UPS power outage didn't touch any of the actual disks the fileserver was using.)

Import attempts reported 'one or more devices is currently unavailable'. Running plain zpool import showed a pool configuration that claimed two of the six mirror vdevs had one side faulted with corrupted data, and listed no spares (although it reported that additional missing devices were known to be in the pool configuration). We knew and could verify that all devices listed in the pool configuration were visible on the fileserver.

Importing the pool on our test Solaris 10 update 8 machine worked perfectly; all devices in pool vdevs and all spares were present (and now healthy). When we exported it from the test machine and tried to import it on the production fileserver, we had the exact same import error all over again; our S10U6 machine just refused to touch it, despite having been perfectly happy with it less than an hour and a half earlier.

We were very fortunate in that we'd already done enough testing to decide that S10U8 was viable in production (with the machine's current patch set) and that the test machine was in a state where it was more or less production ready. Left with no choice, we were forced to abruptly promote the S10U8 machine to production status and migrate the entire (virtual) fileserver to it, slap plaster over various remaining holes, and hope that nothing explodes tomorrow.

ZFSImportFailure written at 02:00:51; Add Comment

2010-05-06

Oracle's future for Sun's hardware and OS business is now clear

The alternate title for this entry is 'how to persuade us to never buy your hardware again'.

The old Sun had both a general server business and a general OS business, and people used both; they bought Sun servers to run lots of operating systems and they ran Solaris on lots of non-Sun hardware. It is now clear that Oracle is nothing like this. Solaris now exists only to run on Oracle hardware, and Oracle hardware exists only to run Solaris and a few other Oracle-supported operating systems.

Why do I say this? Well, it's due to the latest bit of Oracle news, to wit that Oracle has restricted access to firmware updates (via Slashdot). In order to get firmware updates, your hardware either has to be under its one-year warranty or you have to have an Oracle support contract for it. Older than a year and without a support contract? You lose. This policy change was introduced abruptly and with no advance warning; it appears that even (ex-)Sun and Oracle support people may not understand it yet.

(Note that you must have a support contract that includes hardware support. A Solaris software support contract is not good enough, as I have verified.)

This is much more important than it looks from the outside. For most systems, server firmware updates are relatively unimportant; few people ever apply BIOS updates. But Sun servers have integrated lights out management processors, which are network accessible under some circumstances, and they have had security vulnerabilities. These security vulnerabilities are fixed with, you guessed it, firmware updates.

As far as I am concerned, this makes access to firmware updates somewhere between very important and vital for running production Sun servers, especially since their excellent ILOMs were much of the reason to prefer them in the first place.

But wait, it gets better: you cannot buy hardware support without buying Oracle software support (at least for new support contracts), and software support is twice as expensive as the hardware support. Software support costs 8% of the net hardware purchase cost per year, and adding hardware support costs an additional 4% per year (per here, found via Hacker News). Oracle very explicitly won't sell hardware support by itself, and have said so clearly.

This makes it very clear that Oracle intends their hardware almost exclusively for running Oracle-supported operating systems, since if you run a non-supported OS on Oracle hardware, you are completely wasting the 8% a year Oracle software support fee.

(If you do not get hardware support, you are gambling on there not ever being an ILOM security vulnerability that affects you. Since the ILOM is accessible from the server itself under some circumstances, this is not a bet that I would want to take.)

There are two immediate corollaries to this firmware access policy change. First, if you still have systems under hardware warranty (or hardware support contract), get the latest firmware updates now while you still can, even if you don't plan to apply them. Second, smart people buying second-hand Sun servers are likely to either demand that they be at the latest firmware version or require a potentially significant price discount, or both.

(Hence one reason to get the latest firmware updates even if you never plan on applying them yourself.)

OracleSunFuture written at 21:20:07; Add Comment

2010-05-05

The right way to fix ZFS disk glitches (at least for us)

Every so often in our environment of ZFS pools with mirrored vdevs, we will have an iSCSI disk drop out temporarily. When this happens, ZFS winds up faulting the disk with read and write errors, and you get to fix this after the disk is back.

In theory, this is fixed with just 'zpool clear <pool> <disk>'. In practice, our experience is that this will sometimes leave the disk with latent checksum errors (I presume from writes that somehow got lost on the way to the disk without anything noticing), so in order to completely fix up the situation we must then 'zpool scrub' the pool, possibly repeatedly, until there are no errors being reported.

This is kind of annoying, plus it puts an IO load on the entire pool (and can take ages on a big pool). So our alternate, simpler procedure has been to 'zpool detach' and then 'zpool attach' the glitched disk; once the resilver is done, this is guaranteed to have the disk fully intact. Also, the IO load is much more controllable since we are effectively only 'scrubbing' one disk, instead of all disks in the pool at once.

(You might think that this is crazy, but the logic is that we can't trust the glitched disk since we're assuming that it has missed writes; until it's repaired, the vdev is not truly redundant regardless of what ZFS thinks.)

In retrospect, there is a strong (and obvious) reason to prefer the zpool clear approach, even if it takes longer and is more annoying. Even though we can't completely trust the data on the glitched disk, in most situations most of it is still intact and good. The moment we do 'zpool detach', we discard all of that good data. If the vdev is only a two-way mirror, we go from a situation where we were non-redundant on only the missing writes to a situation where we are non-redundant on an entire disk's worth of data (and where ZFS has a much worse potential failure mode).

(How much good data is left on the glitched disk depends on how fast data turns over in the pool and how long the disk was out for.)

In a multi-way mirror that's still fully redundant even without the glitched disk, we might as well use the simpler approach. But with a two-way mirror, we really do want to use the longer, more annoying approach in situations where it's feasible.

(This is the kind of entry that I write to convince myself that I have the logic nailed down, so I can explain it to other people.)

PS: note that our experience is that there are potentially significant IO load differences between scrubbing and resilvering that may affect this choice. Scrubbing is almost entirely reads across all pool devices; resilvering is write heavy to the new disk, and in theory only read heavy on the other mirror(s) in that particular vdev. I believe that resilvering IO may also be considered higher priority than scrub IO. Both scrubs and resilvering are at least somewhat random IO, not strictly sequential, for reasons that do not fit in within the margins of this entry.

ZFSClearVsReplace written at 04:23:33; Add Comment

By day for May 2010: 5 6 28; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.