What Prometheus Blackbox's TLS certificate expiry metrics are checking
One of the things that the Prometheus Blackbox exporter can do is connect to services that use TLS and harvest enough certificate information from them to let you monitor and alert on soon to expire TLS certificates. Traditionally, this was a single metric, probe_ssl_earliest_cert_expiry, but in the 0.17.0 release a second one was added, probe_ssl_last_chain_expiry_timestamp_seconds. TLS certificate expiry issues have been on my mind because of the mess from the AddTrust root expiry, and recently I read a pair of articles by Ryan Sleevi on TLS certificate path building and verifying (part 2), which taught me that this issue isn't at all simple. After all this, I wound up wondering exactly what these two Blackbox exporter metrics were checking.
When you connect to a TLS server, it sends one or more certificates to you, generally at least two, in what is commonly called a certificate chain. These server sent certificates don't include the Certificate Authority's root certificate, because you need to already have that, and they don't actually have to form a single chain or even be related to each other. Normally they should be a chain (and be in a specific order), but people make all sorts of configuration errors and decisions in the certificates that they send. The Blackbox exporter's probe_ssl_earliest_cert_expiry metric is the earliest expiry of any of these server sent certificates. I'm not certain if it's filtered of invalid certificates, but it definitely doesn't exclude self-signed certificates.
(Specifically, it is the earliest expiry of a certificate in the
When you actually verify a TLS certificate (if you do it correctly) you wind up with one or more valid paths between the server certificate and some roots of trust; we can call these the verified chains. These chains will use the server certificate that the server sent you, but they may not use all or even any of the other certificates. To work out the probe_ssl_last_chain_expiry_timestamp_seconds metric, the Blackbox exporter first goes over every verified chain to find out the earliest expiry time of any certificate in it and the picks the latest such chain expiry time. These verified chains do include the CA root certificates, which don't necessarily expire regardless of their nominal expiry time. If there are no verified chains at all, such as if you're dealing with a self-signed certificate, the Blackbox exporter currently makes this metric be an extremely large and useless negative number.
(The verified chains come from the Go crypto.tls.ConnectionState's
VerifiedChains. If there are no verified chains, the metric is
the zero value of Go's time.Time
turned into a time in the Unix epoch. Since this zero value is more
than a thousand years before January 1st 1970 UTC, it winds up very
negative. This is potentially a bug and may change someday.)
Normally there will always be a verified chain, because otherwise the Blackbox TLS probe would fail entirely. You have to specifically set insecure_skip_verify to true in the Blackbox configuration in order to accept self-signed certificates or other chain problems.
So what do these metrics mean, beyond their technical details? If the earliest certificate expiry is soon, it doesn't necessarily mean that your TLS server certificate itself is about to expire, but it does mean that some TLS certificate your server is providing to people is about to. Either you're serving an unnecessary intermediate TLS certificate, or some number of your users are about to have a problem. Either is an issue that you should fix, especially since an expired certificate that's not necessary may still make many TLS libraries fail to verify your server certificate.
(This is part of what happened with the AddTrust expiry. A surprisingly large number of TLS libraries had to be patched to just skip it.)
The last chain expiry is the point at which you definitely will have problems, because no one at all will be able to build a verified chain for your server certificate. A last chain expiry that's well into the future is not a guarantee that you'll be free of problems until then, unless you know that there's only one valid chain that can be formed from your server certificate. If there are multiple chains, not all clients may able to use all chains so some of them could be stuck on chains that might expire earlier. The Blackbox exporter doesn't currently have a metric for the earliest expiring verified chain, but perhaps it should.
(Normally all verified certificate chains will have the same expiry time, because the shortest lifetime certificate on them should be the server's certificate itself. If there are multiple chains and there's a difference between the latest and the earliest chain expiry time, you may be about to have an exciting time (although it's not your fault).)