Diagnosing an install problem: a case study in indirect failures

August 23, 2005

Today I tried to Kickstart-install two SATA-based machines we have in for evaluation, booting them from USB memory sticks. Unfortunately it didn't work; something aborted about when our customization stage took over. (I didn't get to see what, because it promptly rebooted.)

One of my longer-term irritations with Anaconda is that you only get a binary choice of 'always reboot afterwards' or 'never reboot afterwards'; there is no option for 'reboot if all went well, otherwise sit there to let me look at diagnostics'. This lack somewhat slows down troubleshooting, partly because you first have to notice that something went wrong. (Since the machine came up thinking it was called localhost.localdomain, that was fortunately easy.)

Just as I wrote this, I logged in to my test machine and discovered that it too had come up as localhost.localdomain during a test reinstall run. This was good news, because being able to reproduce a problem is always good news. However, this gave me a puzzle: before, I had assumed that the broken machines came up without networking (the usual reason for coming up with such a bad name). But I was logging in over the network; how had the machine come up with a broken hostname but still on the network?

First hypothesis: maybe the test machine had gotten the wrong nameserver somehow. I looked at its /etc/resolv.conf, but it was listing our usual caching DNS server on the mailer machine (email is the most DNS intensive thing we do, so it's the best place for the cache).

At this point it's relevant to mention that electrical work in the building with our primary machine room caused us to shut down and then restart most of our servers.

Second hypothesis: 'oh my god, did the caching nameserver daemon fail to restart when we reboot?' Survey says: whoops, yes it did. Bad me for not noticing this for more than 24 hours; clearly we need better monitoring software.

Looking at the logfile showed that it was failing to start because it couldn't read /var/named/acl.conf. This was because /var/named was owned by the wrong group, and that was because I had corrected a mis-numbered named group in the course of preparing for our upgrade to Fedora Core 4 but had not changed the ownership on all of the systems. (And I had made the changes back in May or June.)

Our systems don't normally reboot and I didn't reboot when I fixed the 'named' group to have the right number, so the existing nameserver daemon on our mailer machine had kept on running (using the old numbers that matched the actual directory ownership).

We use multiple caching nameservers for redundancy, and it had started on at least one of the fallback machines, which meant that our existing systems could still do DNS lookups after we powered all the servers back on. But when done from a USB memory stick, the Kickstart install process only uses a single nameserver, which wasn't there, which caused Kickstart to call the machine localhost.localdomain.

Our customization process keys a number of things off the subdomain that the machine is in. 'localdomain' is not a recognized subdomain, so our customization process immediately aborted; in turn this more or less immediately rebooted the machine.

Since I found the root cause of this problem in the process of writing up a grumble about it and another problem I also hit, this may be a successful example of Rubber Duck debugging (sometimes called rubber ducky debugging instead).

Written on 23 August 2005.
« On being nibbled to death by moths
Completely using an alternate yum configuration »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Aug 23 01:24:47 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.