== Diagnosing an install problem: a case study in indirect failures Today I tried to Kickstart-install two SATA-based machines we have in for evaluation, booting them from USB memory sticks. Unfortunately it didn't work; something aborted about when our customization stage took over. (I didn't get to see what, because it promptly rebooted.) One of my longer-term irritations with Anaconda is that you only get a binary choice of 'always reboot afterwards' or 'never reboot afterwards'; there is no option for 'reboot if all went well, otherwise sit there to let me look at diagnostics'. This lack somewhat slows down troubleshooting, partly because you first have to notice that something went wrong. (Since the machine came up thinking it was called _localhost.localdomain_, that was fortunately easy.) Just as I wrote this, I logged in to my test machine and discovered that it too had come up as _localhost.localdomain_ during a test reinstall run. This was good news, because being able to reproduce a problem is always good news. However, this gave me a puzzle: before, I had assumed that the broken machines came up without networking (the usual reason for coming up with such a bad name). But I was logging in over the network; how had the machine come up with a broken hostname but still on the network? First hypothesis: maybe the test machine had gotten the wrong nameserver somehow. I looked at its _/etc/resolv.conf_, but it was listing our usual caching DNS server on the mailer machine (email is the most DNS intensive thing we do, so it's the best place for the cache). At this point it's relevant to mention that electrical work in the building with our primary machine room caused us to shut down and then restart most of our servers. Second hypothesis: 'oh my god, did the caching nameserver daemon fail to restart when we reboot?' Survey says: whoops, yes it did. Bad me for not noticing this for more than 24 hours; clearly we need better monitoring software. Looking at the logfile showed that it was failing to start because it couldn't read _/var/named/acl.conf_. This was because _/var/named_ was owned by the wrong group, and that was because I had corrected a mis-numbered _named_ group in the course of preparing for our upgrade to Fedora Core 4 but had not changed the ownership on all of the systems. (And I had made the changes back in May or June.) Our systems don't normally reboot and I didn't reboot when I fixed the 'named' group to have the right number, so the existing nameserver daemon on our mailer machine had kept on running (using the old numbers that matched the actual directory ownership). We use multiple caching nameservers for redundancy, and it had started on at least one of the fallback machines, which meant that our existing systems could still do DNS lookups after we powered all the servers back on. But when done from a USB memory stick, the Kickstart install process only uses a single nameserver, which wasn't there, which caused Kickstart to call the machine _localhost.localdomain_. Our customization process keys a number of things off the subdomain that the machine is in. '_localdomain_' is not a recognized subdomain, so our customization process immediately aborted; in turn this more or less immediately rebooted the machine. Since I found the root cause of this problem in the process of writing up a grumble about it and another problem I also hit, this may be a successful example of [[Rubber Duck debugging|http://lists.ethernal.org/cantlug-0211/msg00174.html]] (sometimes called [[rubber ducky debugging|http://lists.ethernal.org/cantlug-0211/msg00174.html]] instead).