2024-03-15
The problem of using basic Prometheus to monitor DNS query results
Suppose that you want to make sure that your DNS servers are working correctly, for both your own zones and for outside DNS names that are important to you. If you have your own zones you may also care that outside people can properly resolve them, perhaps both within the organization and genuine outsiders using public DNS servers. The traditional answer to this is the Blackbox exporter, which can send the DNS queries of your choice to the DNS servers of your choice and validate the result. Well, more or less.
What you specifically do with the Blackbox exporter is that you configure some modules and then you provide those modules targets to check (through your Prometheus configuration). When you're probing DNS, the module's configuration specifies all of the parameters of the DNS query and its validation. This means that if you are checking N different DNS names to see if they give you a SOA record (or an A record or a MX record), you need N different modules. Quite reasonably, the metrics Blackbox generates when you check a target don't (currently) include the actual DNS name or query type that you're making. Why this matters is that it makes it difficult to write a generic alert that will create a specific message that says 'asking for the X type of record for host Y failed'.
You can somewhat get around this by encoding this information into the names of your Blackbox modules and then doing various creative things in your Prometheus configuration. However, you still have to write all of the modules out, even though many of them may be basically cut and paste versions of each other with only the DNS names changed. This has a number of issues, including that it's a disincentive to doing relatively comprehensive cross checks. (I speak from experience with our Prometheus setup.)
There is a third party dns_exporter that can be set up in a more flexible way where all parts of the DNS check can be provided by Prometheus (although it exposes some metrics that risk label cardinality explosions). However this still leaves you to list in your Prometheus configuration a cross-matrix of every DNS name you want to query and every DNS server you want to query against. What you'll avoid is needing to configure a bunch of Blackbox modules (although what you lose is the ability to verify that the queries returned specific results).
To do better, I think we'd need to write a custom program (perhaps run through the script exporter) that contained at least some of this knowledge, such as what DNS servers to check. Then our Prometheus configuration could just say 'check this DNS name against the usual servers' and the script would know the rest. Unfortunately you probably can't reuse any of the current Blackbox code for this, even if you wrote the core of this script in Go.
(You could make such a program relatively generic by having it take the list of DNS servers to query from a configuration file. You might want to make it support multiple lists of DNS servers, each of them named, and perhaps set various flags on each server, and you can get quite elaborate here if you want to.)
(This elaborates on a Fediverse post of mine.)