Prometheus Blackbox 0.23.0 has added a nice improvement to its DNS checks
Over on the Fediverse, I said:
Small Prometheus thing I am happy about: the latest version of the Blackbox exporter now has a 'probe_dns_query_succeeded' metric, so you can tell the difference between 'the DNS server isn't talking to me' and 'the DNS server sent me an answer without the data I was looking for' (or just 'an answer that was fail/go away').
Now I get to (probably) change and update some of our alerts, which can now be more specific.
Our Prometheus system does a bunch of DNS lookups through Blackbox for various purposes. We check that our own DNS servers correctly resolve various things from our domain (and harvest the SOA values in the process so we can verify that everything is in sync), we check that our secondaries properly have our stuff, we check that our internal forwarding resolvers can resolve outside domains, and we check that some outside public DNS servers can resolve things in our domain as a sanity check.
Before the recently released Blackbox 0.23.0, if one of these checks failed we couldn't reliably tell the difference between the DNS server not answering us at all and the DNS server giving us an answer without the right information. For our own DNS servers, this difference generally didn't matter because we have a lot of other monitoring of them. For our secondaries and for public DNS servers, not so much, especially the ones located outside the university network (where all we could do was a combination of pinging them and seeing if they would resolve other queries, including queries they should always be able to answer).
The new 'did your DNS query succeed' metric is quite useful for
telling the difference. If the
probe_success metric is 0 but
probe_dns_query_succeeded is 1, you got a DNS answer but it
didn't contain whatever you expected (what that is depends on your
dns probe configuration;
I believe the minimum is generally going to be a NOERROR rcode).
In our case, we insist on at least minimal answer RRs that we expect
(for example, a SOA response for our domain in a SOA query). To
some extent you could do this already with things like metrics like
probe_dns_answer_rrs; if you had answer RRs, clearly you had
received a DNS answer. But this wasn't a sure-fire thing; a DNS
server that gave you a 'REFUSED' answer would have 0 for all of
answer, additional, and authority RRs returned.
(I sort of wish that Blackbox gave you the DNS rcode, but I think it's a sensible tradeoff to not do so. There is an endless series of 'what went wrong' metrics you could potentially generate for Blackbox checks, but at some point you need to decide that enough is enough. Detailed troubleshooting is not really Blackbox's domain.)