2023-11-28
Why we scrape Prometheus Blackbox's metrics endpoint
The Prometheus Blackbox exporter is how you do
many external checks on machines and services ('endpoints' in
Blackbox's jargon), ranging from ping checks up through making HTTPS
requests and checking the results. The Blackbox exporter has a
somewhat confusing usage; unlike most
exporters, you don't so much scrape it as scrape things through it,
using probes against targets. As part of this, each combination
of probe and target is a separate Prometheus scrape, each of which
generates an 'up
' metric for that particular scrape. Unlike regular
Prometheus exporters, these per-scrape 'up
' metrics aren't all
that useful because all they tell you is that your Prometheus server could talk to that Blackbox exporter.
Actual success or failure of your check is communicated through the
'probe_success
' metric, which will be 0 if it failed for some
reason.
The Blackbox exporter also has its own /metrics endpoint that gives
you metrics for Blackbox itself, which are a combination of general
Go and Prometheus exporter metrics with some Blackbox specific ones.
One of the reasons to monitor this metrics endpoint is that it will
tell you if Blackbox has been unable to successfully reload its
configuration for a while, which is something that saved us with
the main Prometheus daemon. However,
another reason that we monitor the Blackbox metrics endpoint is that
scraping Blackbox's own metrics gives us a simple check of whether
or not it's up, with its own 'up
' metric that's convenient to
alert on.
Of course, you can use the 'up
' metrics you get from scraping
targets through Blackbox, but if you do you have some decisions to
make. Do you pick a single probe and target combination that you
expect to always be present in your configuration and alert if its
'up
' is 0? Do you alert if a sufficient number or percentage of
'up
' metrics for Blackbox probes go to zero? If you're using more
than one Blackbox exporter for whatever reason, do you have labels
set that will tell your alerting rule what Blackbox exporter was
used for a particular scrape?
(It turns out that our Blackbox label rewriting doesn't pass through
this information. It's not normally important, which is probably
why the stock example doesn't preserve it, but it becomes potentially
quite relevant if you're using the 'up
' metrics from Blackbox
checks as a health check on Blackbox itself.)
Simply adding a separate scrape of the Blackbox /metrics endpoint is the simple way out. It gives you a scrape that doesn't depend on what things you're checking through Blackbox, the scrape will definitely have labels that tell you what Blackbox you're talking to, and the extra Blackbox health metrics are potentially useful.