A Prometheus Blackbox gotcha: (UDP) DNS replies have a low size limit

June 22, 2024

For reasons beyond the scope of this entry, we use our Prometheus setup to monitor if we can resolve certain external host names, by doing Blackbox probes to various DNS servers, both our internal resolvers and external ones. Ever since we added this check, we've had weird issues where one of our internal resolvers would periodically fail the check for one particular host name for tens of minutes. This host name involves a long chain of CNAME records and ends with some A records. According to more detailed Blackbox information, the query wasn't failing, it was just not returning all of the information, omitting the A records that we needed. We came up with all sorts of theories about why our DNS server might not be able to fully resolve the CNAME chain, but couldn't find a smoking gun or a firm fix.

Then the other day I was looking at debug output and noticed this:

[...] level=info msg="Got response" response=";; [...] \n;; flags: qr tc rd ra; [...]

(This is in a very long line that puts 'dig' style output for the entire answer in the message, and this whole collection of diagnostic log information is not normally logged as such, merely visible for a while in the Blackbox web interface.)

Did you notice that 'tc' in the flags? That's the flag that is set to indicate a DNS response that has been truncated because it doesn't fit within the size limit. This truncation is what was actually going wrong in our DNS check. This particular DNS name has a chain of CNAMEs, and the providers involved change the CNAMEs relatively rapidly, and some of the time the CNAMEs used were long enough that they pushed the A records our module was looking for out of the truncated DNS reply from our internal DNS resolvers.

As of Blackbox 0.25.0, the Blackbox DNS prober defaults to using UDP, doesn't set any EDNS options to increase the allowed reply size, and doesn't fall back to retrying queries over TCP if a UDP query is truncated. This means Blackbox has the old default UDP DNS reply size limit of 512 bytes, which can easily be exceeded with a large enough CNAME chain, among other things. Unfortunately, there is currently no probe metric that will tell you this has happened.

(If you are sure you know how many answer, authority, and additional DNS RRs will be returned by the query, you can check those metrics, but that won't distinguish between a truncated reply and the DNS server doing something odd.)

The current Blackbox workaround is to change your Blackbox module to use TCP instead of UDP, which doesn't have this sort of size limit. Unfortunately not all DNS servers we care about accept TCP connections (they're not ours, don't ask), so in practice we had to duplicate our Blackbox module to get a TCP version of it, and then switch our internal DNS servers to using the new TCP query module.

I think this behavior has some uses, for example you may want to know if your DNS replies are now too big for non-EDNS UDP clients. However, I think that Blackbox should definitely let you find out if the DNS reply was truncated (ie, had the 'tc' flag set). I also wouldn't mind if a more friendly and modern DNS query process was the Blackbox default, and you had to specifically request a limited version. I suspect that there are various people using Blackbox who don't know just how minimal their DNS probes currently are.

(All of this behavior comes about not directly through Blackbox but through Blackbox doing its DNS queries with github.com/miekg/dns, which documents its behavior in Client.Exchange(). I've filed Blackbox issue #1258 and issue #1259 about this overall situation, so maybe someday we'll be able to see the truncation status in probe metrics and set the EDNS option for a larger message size.)


Comments on this page:

Unfortunately not all DNS servers we care about accept TCP connections (they're not ours, don't ask)

Since you care about those DNS servers, you could consider asking their operators nicely to enable TCP support.

The "best current practice" "requires" TCP support for queries since at least March 2022 (RFC 9210, DNS Transport over TCP - Operational Requirements). The "standards" "require" TCP support in DNS server implementations not only for zone transfers, but also for queries, since at least August 2010 (RFC 5966, DNS Transport over TCP - Implementation Requirements).

I agree, Erik Auerswald. Waving an Internet standard in someone's face like it matters is how IPv6 went on to replace IPv4.

If I ever implement a DNS server and client, I never intend to support TCP, because it's downright painful compared to UDP.

Someone mentioned this as a focused alternative: https://github.com/tykling/dns_exporter/

I have had success with convincing a third party DNS server operator to allow queries via TCP by pointing out that this was the expected behavior at the time. Previously, blocking TCP queries was seen as normal, and they were not aware that this had changed.

Since every DNS server software that can be exposed to the Internet without falling over already supports queries via TCP, they just had to open TCP port 53 in the firewall.

Some providers do care about following standards. I view providing a reference to the relevant current standards as mandatory, not as "waving it in somebody's face".

Well, Erik Auerswald, what's the status on getting this third party to return useless results in some cases, as prescribed in IETF RFC 8482, which exists solely to benefit Cloudflare and no one else?

Some providers do care about following standards. I view providing a reference to the relevant current standards as mandatory, not as "waving it in somebody's face".

The standards increasingly exist to benefit individual corporations which stand to benefit from changing the rules. We see this most strongly with DNS and HTTP, the latter repeatedly redefined by Google for its gain at the expense of everyone else.

Written on 22 June 2024.
« The IMAP LIST command as it interacts with client prefixes in Dovecot
Some notes on ZFS's zstd compression kstats (on Linux) »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Sat Jun 22 21:52:13 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.