We're going to be separating our redundant resolving DNS servers
We have a number of OpenBSD machines in various roles; they're our
firewalls, our resolving DNS servers as well as our public authoritative
DNS server, and so on. For pretty much all of these, we actually
have two identical servers per role in a hot spare setup, so that
we can rapidly recover from various sorts of failures. For our
firewalls, switching from one to another takes manual action (we
have to change which one is plugged into the live network, although
their firewall state is synchronized with pfsync so that a switch is low
impact). For our DNS resolvers, we have both on the network and
list both addresses in our
/etc/resolv.conf, because this works
perfectly fine with DNS servers.
(All of our machines list the same resolver first, which we consider a feature for reasons beyond the scope of this entry. Our routing firewalls don't use CARP for various reasons, some of them historical, but in practice it doesn't matter, as we haven't had a firewall hardware failure. When we have switched firewalls, it's been for software reasons.)
All of this sounds great, except for the bit where I haven't mentioned that these redundant resolving DNS servers are racked next to each other (one on top of the other), plugged into the same rack PDU, and connected to the same leaf switch. We have great protection against server failure, which is what we designed for, but after we discovered that switches can wind up in weird states after power failures it no longer feels quite so sufficient, since working DNS is a crucial component of our environment (as we found out in an earlier power failure).
(Most of our paired redundant servers are racked up this way because it's the most convenient option. They're installed at the same time, generally worked on at the same time, and they need the same network connections. For firewalls, in fact, you need to switch their network cables back and forth to change which is the live one.)
So, as the title of this entry says, we're now going to be separating our resolving DNS servers, both physically and for their network connection, so that the failure of a single rack PDU or leaf switch can't take both of them offline. Unfortunately we can't put one DNS server directly on the same switch as our fileservers; the fileserver switch is a 10G-T switch with a very limited supply of ports.
(Now that I write this entry the obvious question is whether all of our fileservers should be on the same 10G-T switch. Probably it's harmless, because our entire environment will grind to a halt if even a single fileserver drops off the network.)
PS: I suspect that our resolving DNS servers are the only redundant pair that are important to separate this way, but it's clearly something we should think about. We could at least add some extra redundancy for our VPN servers by separating the pairs, and that might be important during a serious problem.