How I got a corrupted metadb replica that paniced Solaris 10 x86
Since I got asked this in a comment on my entry about clearing metadb replicas, here is what I remember of how I managed to get a metadb replica so corrupted that it paniced Solaris 10u3 x86.
- I wanted to experiment with metasets on my test machine, so I
needed a local metadb replica. Because I
didn't know about this I didn't have a spare partition, and because
I didn't know any better I put the local metadb replica in that
tempting slice 8.
(Since I was only really interested in metasets, I didn't do any local DiskSuite stuff, although I did make and delete metasets and so on.)
- sometime later I rebooted the system and it didn't even make it as
far as starting GRUB; I believe it gave some initial GRUB message
and then hung.
(I had been crashing the system repeatedly due to some interesting tests so I did not think too much of this at the time.)
- I booted the machine with a Fedora Core 7 live CD and poked around, verifying that the filesystems were still there.
- after a while I found the
installgrubcommand, booted the Solaris install CD rescue environment, and ran it to get the machine back to a bootable state. (I believe I may have also rebuilt the boot archive at this point on general principle, since I was getting used to it breaking if I sneezed on the system.) - the test Solaris install would then boot but panic, which led me to finding out how you boot Solaris 10 x86 in really single user mode.
- turning off the metainit service let the system boot, but the
moment I typed
metadbormetainitit would panic. - because I was in a hurry and needed the system for other tests,
I ignorantly tried to recover the system by erasing the metadb
replica by
dd'ing zeroes all over slice 8. This destroyed the system completely, since it wiped out the slice partitioning.(If I had been really clever I would have saved a dd image of slice 8 before doing this, but I was very irritated with Solaris 10u3 x86 at this point.)
On the whole it was a very educational experience and led me to look into a number of useful things so I would be better prepared for a future emergency on any production machines we wind up with.
I have one captured panic message from the system and the system disk (which has more in syslog, and it would be possible to extract them if I could reconstruct the necessary slice partitioning). I have since tried a bit to reproduce this in a VMWare Solaris image but haven't been successful, so it is not a simple and easy to reproduce issue.
(The Solaris 10u3 install I was using was current on all recommended patches and on all released patches that applied to a number of areas of interest to us, including ZFS, iSCSI, and DiskSuite.)
|
|