Wandering Thoughts archives

2006-11-16

An annoying omission in the Solaris 8 DiskSuite toolset

We had one of our SAN RAID controllers die today (just the controller; the disks and the data were fine), and as a result I ran across an annoying omission in the DiskSuite toolset.

When the controller went kaboom, all of its logical drives stopped responding and DiskSuite marked all of the submirrors involved as needing maintenance. When the controller was replaced, all of the logical drives came back again, but the problem is that DiskSuite has no direct way to clear the 'needs maintenance' state on the affected submirrors.

For mirror devices with more than one submirror, the DiskSuite approach is the same as the Linux approach: you remove the failed submirror and then re-add it, with metadetach and metattach. I'd prefer a command that just cleared the error status (and started the necessary resync), but the whole process can be done while the filesystem is live and is not too onerous.

(Since we had 28 mirrors with this problem, we wrote some stuff to automate it.)

The real fun and irritation comes in for mirrors that have only one submirror. To clear what is effectively a status flag, you must tear down the top-level mirror device and recreate it. Since this cannot be done with the mirror in use, you must unmount it before hand (and remount it afterwards). This is an especially irritating omission because DiskSuite itself is still perfectly happy to do IO to the nominally failed submirror, so it really is just a harmless status flag (unlike the multiple submirror case, where DiskSuite needs to actively do work to fix things up).

I can see leaving the status marker present until explicitly cleared, so you can scan a system and see which devices had problems and which didn't after an incident. But DiskSuite should provide you with a direct way to acknowledge and clear the warning flag, especially if it's going to be willing to do IO to the 'failed' device anyways.

Given that we could work around the issue, this may seem like a petty complaint. But most of our Solaris servers have their root and /usr filesystems in DiskSuite mirrors, and it could be an interesting comedy hour if we ever have a temporary controller glitch on those drives.

solaris/DiskSuiteGlitchRepair written at 23:20:51; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.