2023-06-10
A potential issue with outstanding query limits in your DNS resolver
A while back we had a little incident where our internal forwarding DNS resolvers stopped resolving internal names. This started happening very shortly after the university's general Internet connectivity was interrupted, and while I believe we're not entirely sure about the root cause, we have a theory that feels plausible. That theory is limits in outstanding queries and generally rate limits in the DNS resolver that our internal forwarding DNS servers were running.
It's routine for a lot of systems inside our network to be looking up external DNS names (and these days many of those names have low TTLs for various reasons). Some number of the systems and programs doing these lookups have fast retries and low backoff periods. When our Internet connection goes away, those of these DNS lookups that require making external queries pile up on our DNS resolvers; the resolver receives the query, fires off the external query, that query vanishes into the darkness of our Internet connection being down, and the DNS resolver has to wait for a generous timeout before it tells the source that there's been a temporary DNS failure. In the mean time the source may time out itself and retry, and certainly other sources are going to make other queries that also send out their own external queries.
It seems entirely plausible that at some point our DNS resolver declares that enough is enough; it has too many outstanding queries to start any new ones. But once you hit this limit, it doesn't just affect new external queries; it also affects new internal queries that might need to be routed to our internal authoritative DNS servers. Our metrics system doesn't have insight into DNS server metrics on our DNS resolvers, but it does give us network packets per second rates and over the course of the incident these behave more or less as you'd expect from something like this (with elevated incoming packet rates and bursts of high incoming and outgoing packets).
One of the interesting aspects of this for me is that it points out a subtle benefit of DNS server software that can be both a recursive resolver and a zone secondary. Unlike a plain recursive resolver, a zone secondary can always answer queries for any name in your zone from its cache (including nonexistent ones), with no need to make what it sees as 'external' queries. Modern DNS servers tend to split this into two programs, one that will do recursive resolving and one that provides 'authoritative' answers for zones, but that makes it harder to insure that some queries can (almost) always be answered even under unusual circumstances.