Prometheus Blackbox 0.23.0 has added a nice improvement to its DNS checks

December 11, 2022

Over on the Fediverse, I said:

Small Prometheus thing I am happy about: the latest version of the Blackbox exporter now has a 'probe_dns_query_succeeded' metric, so you can tell the difference between 'the DNS server isn't talking to me' and 'the DNS server sent me an answer without the data I was looking for' (or just 'an answer that was fail/go away').

Now I get to (probably) change and update some of our alerts, which can now be more specific.

Our Prometheus system does a bunch of DNS lookups through Blackbox for various purposes. We check that our own DNS servers correctly resolve various things from our domain (and harvest the SOA values in the process so we can verify that everything is in sync), we check that our secondaries properly have our stuff, we check that our internal forwarding resolvers can resolve outside domains, and we check that some outside public DNS servers can resolve things in our domain as a sanity check.

Before the recently released Blackbox 0.23.0, if one of these checks failed we couldn't reliably tell the difference between the DNS server not answering us at all and the DNS server giving us an answer without the right information. For our own DNS servers, this difference generally didn't matter because we have a lot of other monitoring of them. For our secondaries and for public DNS servers, not so much, especially the ones located outside the university network (where all we could do was a combination of pinging them and seeing if they would resolve other queries, including queries they should always be able to answer).

The new 'did your DNS query succeed' metric is quite useful for telling the difference. If the probe_success metric is 0 but probe_dns_query_succeeded is 1, you got a DNS answer but it didn't contain whatever you expected (what that is depends on your dns probe configuration; I believe the minimum is generally going to be a NOERROR rcode). In our case, we insist on at least minimal answer RRs that we expect (for example, a SOA response for our domain in a SOA query). To some extent you could do this already with things like metrics like probe_dns_answer_rrs; if you had answer RRs, clearly you had received a DNS answer. But this wasn't a sure-fire thing; a DNS server that gave you a 'REFUSED' answer would have 0 for all of answer, additional, and authority RRs returned.

(I sort of wish that Blackbox gave you the DNS rcode, but I think it's a sensible tradeoff to not do so. There is an endless series of 'what went wrong' metrics you could potentially generate for Blackbox checks, but at some point you need to decide that enough is enough. Detailed troubleshooting is not really Blackbox's domain.)

Written on 11 December 2022.
« Unix's special way of marking login shells goes back to V2 Unix (at least)
An enforced 'real names only' policy forces people to advertise things »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Dec 11 22:51:24 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.