DNS queries to external sources do fail every so often out of the blue
It's tempting to think that DNS is a reliable environment in general practice, where if you're using a good resolving DNS server (including public ones run by eg Google and Cloudflare) and querying for major domains that have well run DNS servers, you won't see failures. After all, if you ask Google's 188.8.131.52 for the DNS A record (IP address) of amazon.com, you would expect it to always work unless something terrible has gone wrong.
For our own reasons, we use Prometheus's Blackbox to make "black box" probes of various endpoints. Included among the probes we make are a variety of DNS probes to a variety of DNS servers. We started out checking to see if our own domains were resolvable, but then we extended this to querying other domains as a cross-check. And since this is part of our Prometheus and Grafana setup, we store all the results and show them on some dashboards. The result is, in some sense, depressing.
Individual DNS queries regularly fail. It doesn't happen very often, but it happens often enough that if we're looking at a one-hour dashboard, we can expect to see at least one failure. In perhaps unsurprising news, queries fail more often to external DNS servers than to internal ones (even when looking up external names), and it happens for both public resolvers and querying primary DNS servers for data they hold.
In typical use of DNS these failures are masked, because most resolvers and I believe most clients will automatically retry at least once or twice. Blackbox is an exception; although it's not documented, it makes only a single DNS query attempt, and it gives you the result. In the default configuration where you're making a UDP based DNS query, that will be a single DNS UDP query packet, so all of the usual UDP things can happen to it and the reply (on top of the DNS server you're querying just not answering you).
In a way this shouldn't really surprise me. I know that the general Internet is a broadly unreliable place, where packets can and do get dropped on a regular basis. But usually everything works well enough that we can ignore that and assume that even UDP and ICMP packets are just going to get through. Not always, though, as this demonstrates.
PS: In a way it's especially surprising to big public resolvers like Google and Cloudflare run, because both have points of presence here in Toronto so anycast routing means our network path to both of them is fairly short. Right now, both Google and Cloudflare appear to be directly connected to the Toronto Internet Exchange.