Wandering Thoughts archives

2023-07-29

Prometheus Blackbox probes and DNS lookups

A while back I discovered that Prometheus will make persistent connections to its scrape targets. Julien Pivotto added the important additional note that Prometheus only does a DNS lookup when it makes a new connection; once the connection is established, it doesn't re-check DNS. One of the consequences of this is that ~~these persistent connections will stay up even if your DNS resolution falls over (which can be useful in trying to understand such a failure). However, there is an important qualification on this. The general version is that scrape targets may well do DNS queries from scratch every time they're scraped by Prometheus. The specific version is that if your DNS stops working, all your name-based Blackbox probes will probably fail, regardless of the health of the targets.

While Prometheus maintains a persistent connection to your Blackbox instance (or possibly more than one, depending on your configuration), your individual Blackbox probes (such as pinging a specific machine) are not persistent in this way, even though they look like the kind of Prometheus scrape targets that would get such persistent connections. Prometheus and Blackbox don't have a persistent ping job running, with the target's DNS cached; instead, Prometheus uses its persistent connection to send a HTTP request to Blackbox that boils down to 'ping this particular name'. When Blackbox gets the request, it will look up the IP address of the name and go ping it. If the DNS lookup for the target fails, the probe will fail.

(Currently I believe there's no general way to detect this in the metrics that Blackbox returns for probes, although you can get the time the DNS lookup took. Possibly a future Blackbox version will expose a 'the initial DNS lookup failed' metric.)

At one level this is mostly what you want. If the DNS data changes for something you're checking through Blackbox, you almost certainly want the check to go to the new IP, not the old IP. And if a DNS entry for some target goes away, you probably don't want the check to keep succeeding even if the target's old IP address is still responding. If you want DNS caching, the place to configure it is on the machine running Blackbox, not in Blackbox.

At another level this means all of your Blackbox probes will fail if your local DNS falls over, even if the machines themselves are still there and working perfectly. If you're also talking to scrape targets on the machine, you can have the odd looking situation where your Blackbox ping and connection checks are failing but the machine's metrics are still flowing fine. This immediate and global failure of Blackbox checks will also complicate your life if you're trying to monitor what your local DNS resolvers are doing and are specifying DNS probes to them using their name (like 'resolver1.example.org:53').

The corollary to this is that if you're checking the health of your local DNS resolvers, you want to make at least some checks by IP address, not (just) by name. If your DNS resolvers are sufficiently unhealthy, Blackbox won't be able to resolve their names (or the names of any external DNS resolvers you may be cross-checking), and so all DNS probes will fail before they even send any queries to the DNS servers you're trying to monitor.

(These days Blackbox can tell you if a DNS query didn't get an answer at all, so you can at least distinguish between 'our local DNS resolver doesn't seem to be there at all' and 'our local DNS resolver is not resolving things for some reason'. This may give you valuable clues as to what went wrong.)

sysadmin/PrometheusBlackboxAndDNS written at 22:12:50;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.