How I got a corrupted metadb replica that paniced Solaris 10 x86

October 21, 2007

Since I got asked this in a comment on my entry about clearing metadb replicas, here is what I remember of how I managed to get a metadb replica so corrupted that it paniced Solaris 10u3 x86.

  • I wanted to experiment with metasets on my test machine, so I needed a local metadb replica. Because I didn't know about this I didn't have a spare partition, and because I didn't know any better I put the local metadb replica in that tempting slice 8.

    (Since I was only really interested in metasets, I didn't do any local DiskSuite stuff, although I did make and delete metasets and so on.)

  • sometime later I rebooted the system and it didn't even make it as far as starting GRUB; I believe it gave some initial GRUB message and then hung.

    (I had been crashing the system repeatedly due to some interesting tests so I did not think too much of this at the time.)

  • I booted the machine with a Fedora Core 7 live CD and poked around, verifying that the filesystems were still there.
  • after a while I found the installgrub command, booted the Solaris install CD rescue environment, and ran it to get the machine back to a bootable state. (I believe I may have also rebuilt the boot archive at this point on general principle, since I was getting used to it breaking if I sneezed on the system.)

  • the test Solaris install would then boot but panic, which led me to finding out how you boot Solaris 10 x86 in really single user mode.
  • turning off the metainit service let the system boot, but the moment I typed metadb or metainit it would panic.

  • because I was in a hurry and needed the system for other tests, I ignorantly tried to recover the system by erasing the metadb replica by dd'ing zeroes all over slice 8. This destroyed the system completely, since it wiped out the slice partitioning.

    (If I had been really clever I would have saved a dd image of slice 8 before doing this, but I was very irritated with Solaris 10u3 x86 at this point.)

On the whole it was a very educational experience and led me to look into a number of useful things so I would be better prepared for a future emergency on any production machines we wind up with.

I have one captured panic message from the system and the system disk (which has more in syslog, and it would be possible to extract them if I could reconstruct the necessary slice partitioning). I have since tried a bit to reproduce this in a VMWare Solaris image but haven't been successful, so it is not a simple and easy to reproduce issue.

(The Solaris 10u3 install I was using was current on all recommended patches and on all released patches that applied to a number of areas of interest to us, including ZFS, iSCSI, and DiskSuite.)

Written on 21 October 2007.
« Why mail systems should not defer rejections to RCPT TO time
Vim options it turns out I want »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 21 21:19:06 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.