The various phases of Prometheus Blackbox's HTTP probe

December 28, 2023

Prometheus's Blackbox exporter (Blackbox) is the Prometheus component that you usually use to make external checks on services, such as whether a HTTP URL is responding the way it should. Blackbox's various ways of checking things are called probes (or probers), and they can report various sorts of metrics; for instance, any Blackbox check involving TLS will provide you with some TLS expiry metrics. One of the HTTP metrics that Blackbox provides is probe_http_duration_seconds, which reports on how long various phases of an HTTP (or HTTPS) check took to complete. However, Blackbox doesn't currently document what those phases are or what they cover, and for reasons beyond the scope of this entry I recently looked them up in the source code, cross referenced to the underlying tracing code in net/http/httptrace.

At the moment, these phases are:

Performing DNS resolution; this isn't necessarily present if your check uses an IP address instead of a name.

How long it took to make the connection. I'm not sure how this interacts with any connection pooling that Blackbox may be doing.
How long it took to perform the TLS handshake, if this is a HTTPS request. It's not present on HTTP requests.

How long it took between completing the connection and receiving the first byte of response. This appears to include the time to transmit the request and any headers included.

How long it took between receiving the first byte and the last byte of the connection.

If your request followed a chain of redirections, I believe that these numbers are summed across all of the requests for each phase. You can partially detect redirections with by looking for a non-zero probe_http_redirects metric (sometimes in combination with probe_http_status_code not being a 30x status code, because if you have a check that's specifically for if there's a redirection that doesn't follow the redirection, probe_http_redirects will still be 1).

If we assume that the time to transmit the request is negligible and the server generated the entire response before it began transmitting any bytes (which isn't uncommon in various situations), then the server's internal response time is in the 'processing' phase. If the server streamed out the response as it was being generated, the server's internal response time is some combination of 'processing' and part or almost all of 'transfer'.

Slowness in 'connect' and 'tls' can be because of network issues or because the host is heavily loaded. The TCP 'connect' phase is normally mostly or entirely handled in the kernel, but TLS requires various user level code to run and consume CPU for signing things, so an overloaded web server might show load spikes here. I believe that a very slow 'connect' is likely a sign that the web server is sufficiently overloaded that its (Unix) kernel has run out of its 'listen backlog' and is now waiting for the web server to accept some more connections (see also).

Blackbox also has a probe_duration_seconds general metric that gives you the total time of the probe, covering everything done in it. For HTTP probes, it's not necessarily the case that the time for the 'processing' phase is most of this total duration and dominates the other phases; it depends on how fast the web server is (and how little work you're asking it to do), in addition to whether or not it's streaming a response as it generates it (which is clearly the case for one server we check, based on the 'transfer' times). Also, the sum of the time taken for all phases doesn't necessarily match up to the probe_duration_seconds total; for our checks the sum can be only around 90% of the total, although the absolute differences are small.

My overall conclusion is that if you want to track how long processing requests takes on your web servers, you probably don't want to rely on Blackbox checks to give you a clear answer. You're better off instrumenting things in the web server or your application; fortunately Apache makes it relatively easy to log server side numbers if you want.

Written on 28 December 2023.
« Web CGI programs aren't particularly slow these days
Your kernel panics in ZFS on Linux probably aren't actual kernel panics »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 28 22:03:16 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.