It's always DNS (a story of our circular dependency)

April 5, 2019

Our building and in fact much of the University of Toronto downtown campus had a major power failure tonight. When power came back on I wasn't really expecting our Ubuntu servers to come back online, but to my surprise they started pinging (which meant not just that the actual servers were booting but that the routers, the firewall, the switches, and so on had come back). However when I started ssh'ing in, our servers were not in a good state. For a start, I didn't have a home directory, and in fact none of our NFS filesystems were mounted and the machines were only part-way through boot, stalled trying to NFS mount our central administrative filesystem.

My first thought was that our fileservers had failed to boot up, either our new Linux ones or our old faithful OmniOS ones, but when I checked they were mostly up. Well, that's getting ahead of things, because when I started to check what actually happened is that the system I was logged in to reported something like 'cannot resolve host <X>'. That would be a serious problem.

(I could resolve our hostnames from an outside machine, which turned out to be very handy since I needed some way to get their IPs so I could log into them.)

We have a pair of recursive OpenBSD-based resolvers; they had booted and could resolve external names, but they couldn't resolve any of our own names. Our configuration uses Unbound backed by NSD, where the NSD on each resolver is supposed to hold a cached copy of our local zones that is refreshed from our private master. In past power shutdowns, this has allowed the resolvers to boot and serve DNS data from our zones even without the private master being up, but this time around it didn't; both NSDs returned SERVFAIl when queried and in 'nsd-control zonestatus' reported things like:

zone: <our-zone>
      state: refreshing
      served-serial: none
      commit-serial: none

Our private master was up, but like everything else it was stalled trying to NFS mount our central administrative filesystem. Since this central filesystem is where our nameserver data lives, this was a hard dependency. This NFS mount turned out to be stalled for two reasons. The obvious and easy to deal with one was that the private master couldn't resolve the hostname of the NFS fileserver. When I tried to mount by IP address, I found the second one; the fileserver itself was refusing mounts because, without working DNS, it couldn't map IP addresses to names to verify NFS mount permission.

(To break this dependency I wound up adding NFS export permission for the IP address of the private master, then manually mounting the filesystem from the fileserver's IP on the private master. This let the boot continue, our private master's nameserver started, our local resolvers could refresh their zones from it, and suddenly internal DNS resolution started working for everyone. Shortly afterward, everyone could at least get the central administrative filesystem mounted.)

So, apparently it really always is DNS, even when you think it won't be and you've tried to engineer things so that your DNS will always work (and when it's worked right in the past).

Written on 05 April 2019.
« A sign of people's fading belief in RSS syndication
I won't be trying out ZFS's new TRIM support for a while »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 5 01:42:40 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.