2015-11-14
ZFS pool import needs much better error messages
One of the frustrating things about dealing with sufficiently damaged
ZFS pools is that 'zpool import' and friends do not generate very
detailed error messages. There are a lot of things that can go wrong
with a ZFS pool that will make it not importable, but 'zpool
import' has clear explanations for only some of them. For many others
all you get is a generic error in 'zpool import' status reporting
of, say:
The pool cannot be imported due to damaged devices or data.
(Here I'm talking about the results of just running 'zpool import'
to see available pools and their states and configuration, not
trying to actually import a pool. Here zpool has lots of room to
write explicit and detailed messages about what seems to be wrong
with your pool's configuration.)
This isn't just an issue of annoying and frustrating people with
opaque, generic error messages. Given that the error messages are
generic, it's quite easy for people to focus only on the obvious
problems that zpool import reports, even if those problems may
not be the reason the pool can't be imported. As it happens I have
a great example of this in action, in this SuperUser question.
When you read this question, can you figure out what's wrong? Both
the SuperUser ZFS community and the ZFS on Linux mailing list
couldn't.
(I believe that everything you need to figure out what's going on
is actually in the information in the question and the code behind
'zpool import' actually knows what the problem is. This assumes
that my diagnosis
is correct, of course.)
Perhaps zpool import should not be fully verbose by default, as
there's a certain amount of information that may only make sense
to people who know a fair bit about how ZFS works. But it certainly
should be possible to get this information with, eg, a verbose
switch instead of having to reverse engineer it from zdb output.
If nothing else, this means that you can get a verbose report and
show it to ZFS exports in the hope that they can tell you what's
wrong.
On a purely pragmatic level I think that zpool import should be
really verbose and detailed when a pool can't be imported. 'My pool
won't import' is one of the most stressful experiences you can have
with ZFS; to get unclear, generic errors at this point is extremely
frustrating and does not help one's mood in the least. This is
exactly the time when large amounts of detail are really, really
appreciated, even if they're telling you exactly how far up the
creek you are.
(This means that I would very much like a 'zpool import -v <pool>'
option that describes exactly what the import is doing or trying
to do and then covers all of the problems that it detected with the
pool configuration, all the things the kernel said to it, and so
on. A report of 'I am asking the kernel to import a pool made up
of the following devices in the following vdev structure' is not
too verbose.)
PS: while this example is from ZFS on Linux and FreeBSD, I've looked at the current Illumos code for zpool and libzfs, and as far as I can see it would have exactly the same problem here.
(Part of the issue is that zpool import and libzfs have what you
could call less than ideal reporting if a pool is marked as active
on some other system and also has configuration problems. But even
if it reported multiple errors I think that the real problem here
would remain obscure; the current 'zpool import' code appears to
deliberately suppress printing out parts of the information necessary.)
2015-11-13
We killed off our SunSolve email contact address yesterday
Back in the days when Sun was Sun, Sun's patch access and support
system was imaginatively called Sunsolve. If you had a support
contract with Sun (which often was only about the ability to get
patches and file bug reports), you had a SunSolve account. We had
one, of course (we have been using
Solaris for longer than it's been Solaris). In the very beginning
we made a classical mistake
and had it in the name and email of a specific sysadmin (who then
moved on), but in the early days of our Solaris 10 fileservers we switched this to a generic email address,
cleverly named sunsolve.
Yesterday, we removed that address.
Our Solaris machines have all been out of commission for a while now, but we left the address in place mostly because of inertia. What pushed me to remove it is the usual reason; we just couldn't get Oracle to stop mailing things to it. I don't think Oracle spammed it (unlike some people), but they did keep sending us information about patch clusters and quarterly updates and this and that, all of which is irrelevant to us these days.
(I managed to get Oracle to mostly knock it off, but the other day they decided that they had an update that was so urgent that they just had to mail it to us. Never mind that we don't have any of the software at issue, that Oracle had our email address was good enough for them.)
At one level this is an unimportant little bit of cleanup that we should have done long ago. With our Solaris machines gone and our grandfathered support contract let run down, the email address had no point; it was just another lingering bit of clutter, and we should get rid of that kind of thing while we remember what it is and why we can remove it.
(If you wait long enough on this sort of thing, you can easily forget whether or not there's some special, inobvious reason that you're keeping these old oddities around. So it's best to strike while everything is fresh in your mind.)
At another level, the sunsolve email address was one of the last
lingering traces of what was (after all) a very long association
with Sun and Solaris. Just as with other things, letting
it go is yet another line drawn under all of that history, even if
SunSolve itself stopped existing years ago.
(Oracle decommissioned SunSolve and folded the functionality into their own support system not long after they bought Sun. The conversion was not entirely pleasant for support customers.)
PS: Since I just looked, it warms my heart a little bit that PCA is still trucking along. Oracle may have killed some very useful customer-done things but at least they left PCA alone. If we still had to deal with the mess that is Solaris patches, we'd be very thankful for that.