2006-11-20
A DiskSuite annoyance: metastat
In the traditional illustrated form:
; metastat fs5/d40
metastat: sandiego.cs: fs5: set name is inconsistent
Almost all DiskSuite commands that deal with disk sets will accept
fsN/dN to specify the set and the disk at once, and metastat itself
produces output in this form. But metastat refuses to accept it as
input; instead you have to use 'metastat -s fs5 d40'.
This isn't just a cosmetic issue. The fsN/dNN format is both easier to
use and safer; the two halves are bundled together and can, for example,
be cut and pasted together, and you can consistently use them that way
even when the set specification is redundant (eg 'metadetach fs5/d40
fs5/d806'). Plus, it's the normal output format and any time you can,
you should make sure that you accept your normal output format as input
so that people have to fiddle with it as little as possible.
(In general, every time you make someone fiddle with your output you increase the chance that a mistake will get made. You get bonus points for making it 'easier' for them to retype it than to copy and paste it, because the chances for errors jump again.)
This isn't metastat's only problem, but it's a typical one. If I had to
summarize the root problem, it would be that metastat seems to have
been designed only as a frontend,
where it was felt that it didn't matter how peculiar the output was and
whether it was completely consistent with everything else about its
command line arguments.
2006-11-16
An annoying omission in the Solaris 8 DiskSuite toolset
We had one of our SAN RAID controllers die today (just the controller; the disks and the data were fine), and as a result I ran across an annoying omission in the DiskSuite toolset.
When the controller went kaboom, all of its logical drives stopped responding and DiskSuite marked all of the submirrors involved as needing maintenance. When the controller was replaced, all of the logical drives came back again, but the problem is that DiskSuite has no direct way to clear the 'needs maintenance' state on the affected submirrors.
For mirror devices with more than one submirror, the DiskSuite approach
is the same as the Linux approach: you remove the failed submirror and
then re-add it, with metadetach and metattach. I'd prefer a command
that just cleared the error status (and started the necessary resync),
but the whole process can be done while the filesystem is live and is
not too onerous.
(Since we had 28 mirrors with this problem, we wrote some stuff to automate it.)
The real fun and irritation comes in for mirrors that have only one submirror. To clear what is effectively a status flag, you must tear down the top-level mirror device and recreate it. Since this cannot be done with the mirror in use, you must unmount it before hand (and remount it afterwards). This is an especially irritating omission because DiskSuite itself is still perfectly happy to do IO to the nominally failed submirror, so it really is just a harmless status flag (unlike the multiple submirror case, where DiskSuite needs to actively do work to fix things up).
I can see leaving the status marker present until explicitly cleared, so you can scan a system and see which devices had problems and which didn't after an incident. But DiskSuite should provide you with a direct way to acknowledge and clear the warning flag, especially if it's going to be willing to do IO to the 'failed' device anyways.
Given that we could work around the issue, this may seem like a petty
complaint. But most of our Solaris servers have their root and /usr
filesystems in DiskSuite mirrors, and it could be an interesting comedy
hour if we ever have a temporary controller glitch on those drives.