How Prometheus Blackbox's TLS certificate metrics would have reacted to AddTrust's root expiry
The last time around I talked about what Blackbox's TLS certificate expiry metrics are checking, but it was all somewhat abstract. The recent AddTrust root expiry provides a great example to make it concrete. As a quick summary, the Blackbox exporter provides two metrics, probe_ssl_earliest_cert_expiry for the earliest expiring certificate and probe_ssl_last_chain_expiry_timestamp_seconds for the latest expiring verified chain of certificates.
If your TLS server included the expiring AddTrust root certificate as one of the chain certificates it was providing to clients, the probe_ssl_earliest_cert_expiry metric would have counted down and your alarms would have gone off, despite the fact that your server certificate itself wasn't necessarily expiring. This would have happened even if the AddTrust certificate wasn't used any more and its inclusion was just a vestige of past practices (for example if you had a 'standard certificate chain set' that everything served). In this case this would have raised a useful alarm, because the mere presence of the AddTrust certificate in your server's provided chain caused problems in some (or many) TLS libraries and clients.
(Browsers were fine, though.)
Even if your TLS server included the AddTrust certificate in its chain and your server certificate could use it for some verified chains, the probe_ssl_last_chain_expiry_timestamp_seconds would not normally have counted down. Most or perhaps all current server certificates could normally be verified through another chain that expired later, which is what matters here. If probe_ssl_last_chain_expiry_timestamp_seconds had counted down too, it would mean that your server certificate could only be verified through the AddTrust certificate for some reason.
Neither metric would have told you if the AddTrust certificate was actually being used by your server certificate through some verified chain of certificates, or if it was now completely unnecessary. Blackbox's TLS metrics don't currently provide any way of knowing that, so if you need to monitor the state of your server certificate chains you'll need another tool.
(There's a third party SSL exporter, but I don't think it does much assessment of chain health, or give you enough metrics to know if a server provided chain certificate is unnecessary.)
If you weren't serving the AddTrust root certificate and had a verified chain that didn't use it, but some clients required it to verify your server certificate, neither Blackbox metric would have warned you about this. Because you weren't serving the certificate, probe_ssl_earliest_cert_expiry would not have counted down; it includes only TLS certificates you actually serve, not all of the TLS certificates required to verify all of your currently valid certificate chains. And probe_ssl_last_chain_expiry_timestamp_seconds wouldn't have counted down because there was an additional verified chain besides the one that used the AddTrust root certificate.
(In general it's very difficult to know if some client is going to have a problem with your certificate chains, because there are many variables. Including outright programming bugs, which were part of the problem with AddTrust. If you want to be worried, read Ryan Sleevi's Path Building vs Path Verifying: Implementation Showdown.)