Let's Encrypt's interesting certificate issuance error
On June 15th (2023), Let's Encrypt paused issuing certificates for about an hour (their status issue). Later, Andrew Ayer wrote up the outside details of what happened in The Story Behind Last Week's Let's Encrypt Downtime, and Let's Encrypt's Aaron Gable explained the technical details in the Mozilla issue about it. The reasons for what happened are interesting, at least to me, and make a lot of sense even if the result is unfortunate.
What was wrong is connected to Certificate Transparency. When a Certificate Authority issues a TLS certificate, it gets Signed Certificate Timestamps (SCTs) for a precertificate version of the certificate from some CT logs and includes them in the TLS certificate it issues. When TLS clients interact with Certificate Transparency, they verify that a TLS certificate has the required SCTs from acceptable logs. However, the SCTs aren't for the actual issued TLS certificate but instead the precertificate, which is is deliberately poisoned so that it can't be used as a real TLS certificate. So in order to verify that the SCTs are for this TLS certificate, the browser has to reconstruct the precertificate version of the certificate. In order for this to be possible, the precertificate and the issued certificate have to be identical apart from the poisoned extension and the SCTs (allowing the browser to accurately reconstruct the precertificate so it can verify that the SCTs are for it).
During the incident, Let's Encrypt issued a number of TLS certificates where the precertificate and issued certificate weren't identical. These TLS certificates didn't pass browser CT checking and also implied a technical compliance failure that made them improper as TLS certificates (see Andrew Ayer's explanation). As explained by Let's Encrypt, one factor in this failure is that Let's Encrypt constructed the issued certificate completely separately from the precertificate, rather than by taking the precertificate and manipulating it. The reason for this decision is, well, let me quote Let's Encrypt directly (without the embedded links, sorry, see the comment itself:
As Rob Stradling suggests in Comment #2, having requests for pre- and final certificate issuance routed to CA instances with different profiles configured would not be an issue if the final certificate was produced as a direct manipulation of the precertificate (effectively, by reversing the algorithm described in RFC6962 Section 3.1).
However, Let’s Encrypt is aware of multiple incidents that have arisen due to CAs trusting client input (e.g. SANs or extensions in a CSR) and/or directly manipulating DER in this way: Bug 1672423, Bug 1445857, Bug 1716123, Bug 1542793, and Bug 1695786 are just a few examples.
We designed our issuance pipeline specifically to avoid bugs such as these. Every issuance, both of precertificates and of final certificates, follows the same basic pattern: a limited set of variables are combined with a strict profile to produce a new certificate from scratch.
TLS certificates are complex structured objects in an arcane and famously complex nested set of standards, X.509 using ASN.1 and all sorts of other fun things. Manipulating complex structured objects that use complex formats is a famously dangerous thing, especially if you need the result to be exactly identical at the binary level and what you're dealing with a flexible serialization format. We've seen security bugs with '<X> serialization' for many <X>s for years, if not decades. For entirely sensible reasons, Let's Encrypt opted to completely sidestep all of this by constructing each variant of the certificate from scratch, as they described.
(Unfortunately Let's Encrypt could do this in two different places, and for a brief period the configurations that drove all of this in the two places diverged, creating the incident.)
My personal view is that Let's Encrypt made the right decision on how to construct precertificates and certificates, even though it was one factor in their issuance failure. This particular issuance failure is much less severe than other sorts of potential failures you could get from trying to manipulate TLS certificates, so I'd rather have it. And the failure caused things to 'fail closed', with the certificates failing to validate in browsers that check Certificate Transparency status.
Overall, I think this is an interesting failure case. A sensible security focused decision combined with an oversight when planning a deployment created a surprise issue. It feels like there's no obvious moral, though (and as always, saying it was human error to not catch the deployment issue is the wrong answer).