Wandering Thoughts archives

2015-11-14

ZFS pool import needs much better error messages

One of the frustrating things about dealing with sufficiently damaged ZFS pools is that 'zpool import' and friends do not generate very detailed error messages. There are a lot of things that can go wrong with a ZFS pool that will make it not importable, but 'zpool import' has clear explanations for only some of them. For many others all you get is a generic error in 'zpool import' status reporting of, say:

The pool cannot be imported due to damaged devices or data.

(Here I'm talking about the results of just running 'zpool import' to see available pools and their states and configuration, not trying to actually import a pool. Here zpool has lots of room to write explicit and detailed messages about what seems to be wrong with your pool's configuration.)

This isn't just an issue of annoying and frustrating people with opaque, generic error messages. Given that the error messages are generic, it's quite easy for people to focus only on the obvious problems that zpool import reports, even if those problems may not be the reason the pool can't be imported. As it happens I have a great example of this in action, in this SuperUser question. When you read this question, can you figure out what's wrong? Both the SuperUser ZFS community and the ZFS on Linux mailing list couldn't.

(I believe that everything you need to figure out what's going on is actually in the information in the question and the code behind 'zpool import' actually knows what the problem is. This assumes that my diagnosis is correct, of course.)

Perhaps zpool import should not be fully verbose by default, as there's a certain amount of information that may only make sense to people who know a fair bit about how ZFS works. But it certainly should be possible to get this information with, eg, a verbose switch instead of having to reverse engineer it from zdb output. If nothing else, this means that you can get a verbose report and show it to ZFS exports in the hope that they can tell you what's wrong.

On a purely pragmatic level I think that zpool import should be really verbose and detailed when a pool can't be imported. 'My pool won't import' is one of the most stressful experiences you can have with ZFS; to get unclear, generic errors at this point is extremely frustrating and does not help one's mood in the least. This is exactly the time when large amounts of detail are really, really appreciated, even if they're telling you exactly how far up the creek you are.

(This means that I would very much like a 'zpool import -v <pool>' option that describes exactly what the import is doing or trying to do and then covers all of the problems that it detected with the pool configuration, all the things the kernel said to it, and so on. A report of 'I am asking the kernel to import a pool made up of the following devices in the following vdev structure' is not too verbose.)

PS: while this example is from ZFS on Linux and FreeBSD, I've looked at the current Illumos code for zpool and libzfs, and as far as I can see it would have exactly the same problem here.

(Part of the issue is that zpool import and libzfs have what you could call less than ideal reporting if a pool is marked as active on some other system and also has configuration problems. But even if it reported multiple errors I think that the real problem here would remain obscure; the current 'zpool import' code appears to deliberately suppress printing out parts of the information necessary.)

ZFSImportBetterErrors written at 00:35:51; Add Comment

2015-11-13

We killed off our SunSolve email contact address yesterday

Back in the days when Sun was Sun, Sun's patch access and support system was imaginatively called Sunsolve. If you had a support contract with Sun (which often was only about the ability to get patches and file bug reports), you had a SunSolve account. We had one, of course (we have been using Solaris for longer than it's been Solaris). In the very beginning we made a classical mistake and had it in the name and email of a specific sysadmin (who then moved on), but in the early days of our Solaris 10 fileservers we switched this to a generic email address, cleverly named sunsolve.

Yesterday, we removed that address.

Our Solaris machines have all been out of commission for a while now, but we left the address in place mostly because of inertia. What pushed me to remove it is the usual reason; we just couldn't get Oracle to stop mailing things to it. I don't think Oracle spammed it (unlike some people), but they did keep sending us information about patch clusters and quarterly updates and this and that, all of which is irrelevant to us these days.

(I managed to get Oracle to mostly knock it off, but the other day they decided that they had an update that was so urgent that they just had to mail it to us. Never mind that we don't have any of the software at issue, that Oracle had our email address was good enough for them.)

At one level this is an unimportant little bit of cleanup that we should have done long ago. With our Solaris machines gone and our grandfathered support contract let run down, the email address had no point; it was just another lingering bit of clutter, and we should get rid of that kind of thing while we remember what it is and why we can remove it.

(If you wait long enough on this sort of thing, you can easily forget whether or not there's some special, inobvious reason that you're keeping these old oddities around. So it's best to strike while everything is fresh in your mind.)

At another level, the sunsolve email address was one of the last lingering traces of what was (after all) a very long association with Sun and Solaris. Just as with other things, letting it go is yet another line drawn under all of that history, even if SunSolve itself stopped existing years ago.

(Oracle decommissioned SunSolve and folded the functionality into their own support system not long after they bought Sun. The conversion was not entirely pleasant for support customers.)

PS: Since I just looked, it warms my heart a little bit that PCA is still trucking along. Oracle may have killed some very useful customer-done things but at least they left PCA alone. If we still had to deal with the mess that is Solaris patches, we'd be very thankful for that.

SunSolveEnding written at 00:27:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.