2023-09-25
Splitting our local DNS resolvers apart to serve different audiences
We operate a collection of our own servers, and also a bunch of internal networks for other people's machines. As part of this we operate our own DNS infrastructure, including local DNS recursive resolvers (which are necessary to handle things like our split horizon DNS setup). Historically we have used one set of local DNS resolvers to handle everything; both DNS lookups from our own servers and DNS lookups from other people's internal machines go to the same DNS resolvers. After all, why not? It's simpler that way. Well, until things go wrong, which they have now done more than once.
It's a sad reality of modern life that you cannot count on arbitrary machines (or pieces of software) being sensible DNS clients. Every so often you're going to have a machine or a piece of software that freaks out or has something go wrong such that it sends your DNS resolvers a flood of queries, sometimes for DNS names that don't exist or don't currently resolve; this can happen if, for example, there's a program that rapidly retries failed DNS lookups (or DNS lookups that merely didn't get answered fast enough). Therefor, DNS resolvers that handle traffic from arbitrary clients are very likely to get hammered every so often, and if you're unlucky they'll be sufficiently badly affected that other clients start having their queries fail.
We've realized this sad truth recently, and the corollary that this makes it a bit problematic to use the same local DNS resolvers for third party machines not under our control and our own servers, or at least important and carefully controlled ones like our fileservers or our Prometheus based monitoring (where parts of it need regular DNS lookups). Because they can run arbitrary user programs, machines such as our SLURM based compute servers are rather more like third party machines, because arbitrary user programs can have arbitrary DNS behavior (especially in a Computer Science research environment; a corporate Unix environment is probably less unpredictable).
What we've decided to do is to make some of our machines use another
internal DNS resolver as their normal default resolver (currently
by listing it first in /etc/resolv.conf
; our other DNS resolvers
remain listed, acting as fallbacks). This new resolver is exactly
the same as our current DNS resolvers, it's just not used by other
people's machines. We hope that this new 'private' resolver will
be less likely to have surprise problems, so critical core systems
will be more likely to keep working during DNS problems.
(DNS problems are still problems, because we need to provide working DNS to people. But there's a difference between having problems and having our fileservers start refusing NFS access because they can't map IPs to names. If NFS breaks, everything breaks.)