2024-06-22
A Prometheus Blackbox gotcha: (UDP) DNS replies have a low size limit
For reasons beyond the scope of this entry, we use our Prometheus setup to monitor if we can resolve certain external host names, by doing Blackbox probes to various DNS servers, both our internal resolvers and external ones. Ever since we added this check, we've had weird issues where one of our internal resolvers would periodically fail the check for one particular host name for tens of minutes. This host name involves a long chain of CNAME records and ends with some A records. According to more detailed Blackbox information, the query wasn't failing, it was just not returning all of the information, omitting the A records that we needed. We came up with all sorts of theories about why our DNS server might not be able to fully resolve the CNAME chain, but couldn't find a smoking gun or a firm fix.
Then the other day I was looking at debug output and noticed this:
[...] level=info msg="Got response" response=";; [...] \n;; flags: qr tc rd ra; [...]
(This is in a very long line that puts 'dig' style output for the entire answer in the message, and this whole collection of diagnostic log information is not normally logged as such, merely visible for a while in the Blackbox web interface.)
Did you notice that 'tc
' in the flags? That's the flag that is
set to indicate a DNS response that has been truncated because it
doesn't fit within the size limit. This truncation is what was
actually going wrong in our DNS check. This particular DNS name has
a chain of CNAMEs, and the providers involved change the CNAMEs
relatively rapidly, and some of the time the CNAMEs used were long
enough that they pushed the A records our module was looking for
out of the truncated DNS reply from our internal DNS resolvers.
As of Blackbox 0.25.0, the Blackbox DNS prober defaults to using UDP, doesn't set any EDNS options to increase the allowed reply size, and doesn't fall back to retrying queries over TCP if a UDP query is truncated. This means Blackbox has the old default UDP DNS reply size limit of 512 bytes, which can easily be exceeded with a large enough CNAME chain, among other things. Unfortunately, there is currently no probe metric that will tell you this has happened.
(If you are sure you know how many answer, authority, and additional DNS RRs will be returned by the query, you can check those metrics, but that won't distinguish between a truncated reply and the DNS server doing something odd.)
The current Blackbox workaround is to change your Blackbox module to use TCP instead of UDP, which doesn't have this sort of size limit. Unfortunately not all DNS servers we care about accept TCP connections (they're not ours, don't ask), so in practice we had to duplicate our Blackbox module to get a TCP version of it, and then switch our internal DNS servers to using the new TCP query module.
I think this behavior has some uses, for example you may want to know if your DNS replies are now too big for non-EDNS UDP clients. However, I think that Blackbox should definitely let you find out if the DNS reply was truncated (ie, had the 'tc' flag set). I also wouldn't mind if a more friendly and modern DNS query process was the Blackbox default, and you had to specifically request a limited version. I suspect that there are various people using Blackbox who don't know just how minimal their DNS probes currently are.
(All of this behavior comes about not directly through Blackbox but through Blackbox doing its DNS queries with github.com/miekg/dns, which documents its behavior in Client.Exchange(). I've filed Blackbox issue #1258 and issue #1259 about this overall situation, so maybe someday we'll be able to see the truncation status in probe metrics and set the EDNS option for a larger message size.)