Diagnosing an install problem: a case study in indirect failures
Today I tried to Kickstart-install two SATA-based machines we have in for evaluation, booting them from USB memory sticks. Unfortunately it didn't work; something aborted about when our customization stage took over. (I didn't get to see what, because it promptly rebooted.)
One of my longer-term irritations with Anaconda is that you only get a
binary choice of 'always reboot afterwards' or 'never reboot
afterwards'; there is no option for 'reboot if all went well,
otherwise sit there to let me look at diagnostics'. This lack somewhat
slows down troubleshooting, partly because you first have to notice
that something went wrong. (Since the machine came up thinking it was
called localhost.localdomain
, that was fortunately easy.)
Just as I wrote this, I logged in to my test machine and discovered
that it too had come up as localhost.localdomain
during a test
reinstall run. This was good news, because being able to reproduce a
problem is always good news. However, this gave me a puzzle: before, I
had assumed that the broken machines came up without networking (the
usual reason for coming up with such a bad name). But I was logging in
over the network; how had the machine come up with a broken hostname
but still on the network?
First hypothesis: maybe the test machine had gotten the wrong
nameserver somehow. I looked at its /etc/resolv.conf
, but it was
listing our usual caching DNS server on the mailer machine (email
is the most DNS intensive thing we do, so it's the best place for
the cache).
At this point it's relevant to mention that electrical work in the building with our primary machine room caused us to shut down and then restart most of our servers.
Second hypothesis: 'oh my god, did the caching nameserver daemon fail to restart when we reboot?' Survey says: whoops, yes it did. Bad me for not noticing this for more than 24 hours; clearly we need better monitoring software.
Looking at the logfile showed that it was failing to start because it
couldn't read /var/named/acl.conf
. This was because /var/named
was
owned by the wrong group, and that was because I had corrected a
mis-numbered named
group in the course of preparing for our upgrade
to Fedora Core 4 but had not changed the ownership on all of the
systems. (And I had made the changes back in May or June.)
Our systems don't normally reboot and I didn't reboot when I fixed the 'named' group to have the right number, so the existing nameserver daemon on our mailer machine had kept on running (using the old numbers that matched the actual directory ownership).
We use multiple caching nameservers for redundancy, and it had started
on at least one of the fallback machines, which meant that our
existing systems could still do DNS lookups after we powered all the
servers back on. But when done from a USB memory stick, the Kickstart
install process only uses a single nameserver, which wasn't there,
which caused Kickstart to call the machine localhost.localdomain
.
Our customization process keys a number of things off the subdomain
that the machine is in. 'localdomain
' is not a recognized subdomain,
so our customization process immediately aborted; in turn this more or
less immediately rebooted the machine.
Since I found the root cause of this problem in the process of writing up a grumble about it and another problem I also hit, this may be a successful example of Rubber Duck debugging (sometimes called rubber ducky debugging instead).
|
|