Maybe a local, on-machine caching DNS resolver should be standard (for us)

April 26, 2021

We have traditionally configured our Ubuntu servers to have an /etc/resolv.conf that points at our central recursive DNS resolvers. People in research group sandbox networks have generally done likewise, partly because it's usually been the easiest thing to do. Machines have to consult our local resolvers in order to correctly look up other local machines, and once you're doing that you might as well not add any extra layers (which have generally taken extra work to add). But there's a downside to this configuration.

Every so often someone either writes or runs a program that does a lot of hostname lookups. Often this is as part of making a lot of connections, for example to fetch a bunch of external resources. Very few programming languages and standard libraries cache the results of those lookups even if they are all of the same hostname (and for good reason, especially in a world where the IP associated with a hostname can change rapidly). But in our environment, this results in a flood of requests to our local resolvers, a flood that would be drastically reduced by even a little bit of local caching. Local caching would also make the responses faster, since even on the same network, an over the network DNS query is slower than querying a daemon on your own machine.

Adding an extra layer of DNS caching does create some operational issues, especially if it caches negative answers. These issues can be reduced if DNS answers are only cached for a very short amount of time, but that generally takes extra configuration (if it's even possible). It's also traditionally taken an extra setup step and extra configuration in general, which is part of our bias against doing it. However, systemd is on its way to changing that with systemd-resolved, although there are plenty of questions about how it will work in an environment like ours and whether Ubuntu will ever adopt it as a standard part of server installs.

So far, we've been aggressive about disabling systemd-resolved in our install system (and haven't set up any other local caching resolver). However I'm starting to wonder if we should change that, especially if Ubuntu switches to normally wanting systemd-resolved on (so that, for example, netplan is unhappy with you if resolved isn't running).

(To really answer this question we should probably get fine grained query statistics from our DNS servers, or at least packets per second statistics. But that's a longer term project for various reasons.)


Comments on this page:

By Twirrim at 2021-04-26 14:10:25:

I favour having local DNS cache just because of the "free" speed boost it provides. No matter how close you are to a DNS server, it's hard to beat the response time of a caching resolver on your machine. It's remarkable just how much software hits DNS.

By Danny Thomas at 2021-04-27 03:53:37:

Even though our monitoring systems were often configured with ip-addresses for remote tests, we ended up running nameservers on the monitoring hosts which secondaried some important server zones in case our main nameservers became difficult to reach. Like many parts of our monitoring, this was driven by real problems we encountered. Things weren't helped by the default timeout/retry in RHEL's resolve.conf (particularly in the early days of ipv6). The server group was keen about search-suffixes to avoid typing FQDNs, but for a bad DNS lookup or when main nameservers weren't answering each suffix involved an additional timeout. This is going back a lot of years, but I had thought a system wide configurable caching resolver sitting between the OS & applications could be useful because even if libc implemented caching that would apply to only one process (unless there was a shared cache backend).

These instances of bind forwarded queries to our main nameservers and cached results for maybe 5 seconds. While TTLs for delegation records were long, most records in the secondaried (server) zones were much shorter in case a quick change was needed.

Adding nameservers does involve extra configuration including their own tests, though we already had tests that would check

 * all nameservers were authoritative for their zones
 * SOA serial number matched that of DNS master (with some slop for
   zones with lots of dynamic updates; ideally forward/reverse zones
   with DDNS aren't too big.
By Miksa at 2021-04-27 06:54:03:

The idea of a local cache is nice, but I would need a reliable one. My most notable experience with systemd-resolved was when we had a short DNS outage at work and resolved somehow got jammed and couldn't find anything afterwards. Restarting resolved didn't fix it so I resorted to rebooting my Ubuntu virtualmachine, but supposedly it could have been enough to restart networking.

By George at 2021-04-29 09:03:01:

I believe dns has negative caching time in soa records, so owner or the zone can control negative cache time.

This is something I never used to be concerned with (I vaguely assumed that the resolver and/or client libs did some caching) until I watched the DNS traffic with tcpdump on one of our app servers. Turns out every request to something outside the local host (e.g. a database or LDAP query - those IPs never change) generated a lookup each time. After that, I started enabling dnsmasq within NetworkManager on CentOS. Across our estate, this should cumulatively add up to a fair amount of saved lookups, network trips and millisecond waits, and gives us a bit of resilience against unavailable name servers. (It also better balances queries across both the configured servers.)

From 193.219.181.219 at 2021-05-06 04:21:04:

I wish systemd-resolved were a little more reliable... My preference for DNS is Unbound – had a few poor experiences with dnsmasq (not to mention its ugly configuration language).

But in addition to that, Glibc itself comes with the nscd daemon which isn't a DNS cache, but rather a nsswitch cache. It'll cache file and LDAP lookups for getpwnam(), NIS lookups for innetgr(), DNS lookups for gethostbyname(), and so forth.

All of the nsswitch calls always look for the nscd socket first, so there's not much in terms of installation, the cache will be used as soon as it's running.

Of course, it won't help with DNS queries that are done using a separate resolver library bypassing nsswitch (e.g. when querying MX or SSHFP, or just when programs use -lresolv), but those are likely in the minority.

Written on 26 April 2021.
« The question of having SATA drives behind modern SAS expanders
The question of how to do non-annoying multi-factor authentication for SSH »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 26 00:26:23 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.