How I got a corrupted metadb replica that paniced Solaris 10 x86
Since I got asked this in a comment on my entry about clearing metadb
replicas, here is what I remember of how I managed to
get a metadb replica so corrupted that it paniced Solaris 10u3 x86.
- I wanted to experiment with metasets on my test machine, so I
needed a local metadb replica. Because I
didn't know about this I didn't have a spare partition, and because
I didn't know any better I put the local metadb replica in that
tempting slice 8.
(Since I was only really interested in metasets, I didn't do any
local DiskSuite stuff, although I did make and delete metasets
and so on.)
- sometime later I rebooted the system and it didn't even make it as
far as starting GRUB; I believe it gave some initial GRUB message
and then hung.
(I had been crashing the system repeatedly due to some interesting
tests so I did not think too much of this at
the time.)
- I booted the machine with a Fedora Core 7 live CD and poked around,
verifying that the filesystems were still there.
- after a while I found the
installgrub command, booted the Solaris
install CD rescue environment, and ran it to get the machine back
to a bootable state. (I believe I may have also rebuilt the boot
archive at this point on general principle, since I was getting
used to it breaking if I sneezed on the system.)
- the test Solaris install would then boot but panic, which led me
to finding out how you boot Solaris 10 x86 in really single
user mode.
- turning off the metainit service let the system boot, but the
moment I typed
metadb or metainit it would panic.
- because I was in a hurry and needed the system for other tests,
I ignorantly tried to recover the system by erasing the metadb
replica by
dd'ing zeroes all over slice 8. This destroyed the
system completely, since it wiped out the slice partitioning.
(If I had been really clever I would have saved a dd image of
slice 8 before doing this, but I was very irritated with Solaris
10u3 x86 at this point.)
On the whole it was a very educational experience and led me to look into a number of useful things so I would
be better prepared for a future emergency on any production machines
we wind up with.
I have one captured panic message from the system and the system disk
(which has more in syslog, and it would be possible to extract them if
I could reconstruct the necessary slice partitioning). I have since
tried a bit to reproduce this in a VMWare Solaris image but haven't been
successful, so it is not a simple and easy to reproduce issue.
(The Solaris 10u3 install I was using was current on all recommended
patches and on all released patches that applied to a number of areas
of interest to us, including ZFS, iSCSI, and DiskSuite.)