2023-12-28
The various phases of Prometheus Blackbox's HTTP probe
Prometheus's Blackbox exporter (Blackbox) is the Prometheus component that you usually use to make external checks on services, such as whether a HTTP URL is responding the way it should. Blackbox's various ways of checking things are called probes (or probers), and they can report various sorts of metrics; for instance, any Blackbox check involving TLS will provide you with some TLS expiry metrics. One of the HTTP metrics that Blackbox provides is probe_http_duration_seconds, which reports on how long various phases of an HTTP (or HTTPS) check took to complete. However, Blackbox doesn't currently document what those phases are or what they cover, and for reasons beyond the scope of this entry I recently looked them up in the source code, cross referenced to the underlying tracing code in net/http/httptrace.
At the moment, these phases are:
resolve
- Performing DNS resolution; this isn't necessarily present if
your check uses an IP address instead of a name.
connect
- How long it took to make the connection. I'm not sure how this interacts with any connection pooling that Blackbox may be doing.
tls
- How long it took to perform the TLS handshake, if this is a HTTPS
request. It's not present on HTTP requests.
processing
- How long it took between completing the connection and
receiving the first byte of response. This appears to include the time
to transmit the request and any headers included.
transfer
- How long it took between receiving the first byte and the last byte of the connection.
If your request followed a chain of redirections, I believe that these numbers are summed across all of the requests for each phase. You can partially detect redirections with by looking for a non-zero probe_http_redirects metric (sometimes in combination with probe_http_status_code not being a 30x status code, because if you have a check that's specifically for if there's a redirection that doesn't follow the redirection, probe_http_redirects will still be 1).
If we assume that the time to transmit the request is negligible
and the server generated the entire response before it began
transmitting any bytes (which isn't uncommon in various situations),
then the server's internal response time is in the 'processing
'
phase. If the server streamed out the response as it was being
generated, the server's internal response time is some combination
of 'processing
' and part or almost all of 'transfer
'.
Slowness in 'connect
' and 'tls
' can be because of network issues
or because the host is heavily loaded. The TCP 'connect' phase is
normally mostly or entirely handled in the kernel, but TLS requires
various user level code to run and consume CPU for signing things,
so an overloaded web server might show load spikes here. I believe
that a very slow 'connect
' is likely a sign that the web server
is sufficiently overloaded that its (Unix) kernel has run out of
its 'listen backlog' and is now
waiting for the web server to accept some more connections (see
also).
Blackbox also has a probe_duration_seconds general metric that
gives you the total time of the probe, covering everything done in
it. For HTTP probes, it's not necessarily the case that the time
for the 'processing
' phase is most of this total duration and
dominates the other phases; it depends on how fast the web server
is (and how little work you're asking it to do), in addition to
whether or not it's streaming a response as it generates it (which
is clearly the case for one server we check, based on the 'transfer'
times). Also, the sum of the time taken for all phases doesn't
necessarily match up to the probe_duration_seconds total; for
our checks the sum can be only around 90% of the total, although
the absolute differences are small.
My overall conclusion is that if you want to track how long processing requests takes on your web servers, you probably don't want to rely on Blackbox checks to give you a clear answer. You're better off instrumenting things in the web server or your application; fortunately Apache makes it relatively easy to log server side numbers if you want.