How Prometheus Blackbox's TLS certificate metrics would have reacted to AddTrust's root expiry

June 29, 2020

The last time around I talked about what Blackbox's TLS certificate expiry metrics are checking, but it was all somewhat abstract. The recent AddTrust root expiry provides a great example to make it concrete. As a quick summary, the Blackbox exporter provides two metrics, probe_ssl_earliest_cert_expiry for the earliest expiring certificate and probe_ssl_last_chain_expiry_timestamp_seconds for the latest expiring verified chain of certificates.

If your TLS server included the expiring AddTrust root certificate as one of the chain certificates it was providing to clients, the probe_ssl_earliest_cert_expiry metric would have counted down and your alarms would have gone off, despite the fact that your server certificate itself wasn't necessarily expiring. This would have happened even if the AddTrust certificate wasn't used any more and its inclusion was just a vestige of past practices (for example if you had a 'standard certificate chain set' that everything served). In this case this would have raised a useful alarm, because the mere presence of the AddTrust certificate in your server's provided chain caused problems in some (or many) TLS libraries and clients.

(Browsers were fine, though.)

Even if your TLS server included the AddTrust certificate in its chain and your server certificate could use it for some verified chains, the probe_ssl_last_chain_expiry_timestamp_seconds would not normally have counted down. Most or perhaps all current server certificates could normally be verified through another chain that expired later, which is what matters here. If probe_ssl_last_chain_expiry_timestamp_seconds had counted down too, it would mean that your server certificate could only be verified through the AddTrust certificate for some reason.

Neither metric would have told you if the AddTrust certificate was actually being used by your server certificate through some verified chain of certificates, or if it was now completely unnecessary. Blackbox's TLS metrics don't currently provide any way of knowing that, so if you need to monitor the state of your server certificate chains you'll need another tool.

(There's a third party SSL exporter, but I don't think it does much assessment of chain health, or give you enough metrics to know if a server provided chain certificate is unnecessary.)

If you weren't serving the AddTrust root certificate and had a verified chain that didn't use it, but some clients required it to verify your server certificate, neither Blackbox metric would have warned you about this. Because you weren't serving the certificate, probe_ssl_earliest_cert_expiry would not have counted down; it includes only TLS certificates you actually serve, not all of the TLS certificates required to verify all of your currently valid certificate chains. And probe_ssl_last_chain_expiry_timestamp_seconds wouldn't have counted down because there was an additional verified chain besides the one that used the AddTrust root certificate.

(In general it's very difficult to know if some client is going to have a problem with your certificate chains, because there are many variables. Including outright programming bugs, which were part of the problem with AddTrust. If you want to be worried, read Ryan Sleevi's Path Building vs Path Verifying: Implementation Showdown.)


Comments on this page:

By Guus at 2020-07-01 03:02:07:

Even if your TLS server included the AddTrust certificate in its chain and your server certificate could use it for some verified chains, the probe_ssl_last_chain_expiry_timestamp_seconds would not normally have counted down. Most or perhaps all current server certificates could normally be verified through another chain that expired later, which is what matters here.

That actually seems like a strange decision for a monitoring system. In this case, the TLS server would be (also) serving an (almost) expired certificate, but it's ok, because there are also other chains (workarounds) that will work.

IMHO the monitoring system should warn about this. Not because it's an immediate problem, but because the server is actively serving something that you're looking for.

TLS (and it's certificates) is complex, but this actually makes quite a nice case for always including the root cert(s) in your chain. Especially when there are more root certs about to expire.

Although I'm not sure what overhead this would bring. It does bring some more maintenance (another detail to monitor/maintain), but perhaps preferable to waiting till things explode?

By cks at 2020-07-07 00:25:40:

Given how you get multiple TLS certificate chains in practice, I think that Prometheus's current monitoring is doing reasonably okay. In practice you're pretty unlikely to have a chain expiring that doesn't involve an intermediate certificate that you're serving, so probe_ssl_earliest_cert_expiry will count down and trigger alerts. If you want to know how severe the issue is, you can look at probe_ssl_last_chain_expiry_timestamp_seconds; if it's also counting down, you have a serious problem. If it's not counting down, you have one chain expiring early but another one that's still good.

The only case where you'd want a separate metric for the earliest expiring certificate chain is if you have multiple chains and the Certificate Authority root certificate is expiring earlier than the intermediate certificate that it signed. CAs should not normally allow this; they should limit the lifetime of the intermediate certificates to the lifetime of the root certificate. This was the case with the AddTrust root certificate that expired, as the intermediate certificate that chained to it also expired at the same time.

(There have been CAs in the past that issued intermediate and server certificates that were expiring later than their root certificates. But I think that this is no longer allowed by the required CA standards, which these days are enforced by the browsers.)

Written on 29 June 2020.
« Adapting our Django web app to changing requirements by not doing much
The unfortunate limitation in ZFS filesystem quotas and refquota »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Jun 29 22:53:14 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.