2021-02-10
Let's Encrypt is preparing for an emergency and that's good for TLS in general
Recently I read Let's Encrypt's Preparing to Issue 200 Million Certificates in 24 Hours (via), which is a high level view of exactly what the title says. Let's Encrypt isn't anywhere near that volume level in normal operation, but their reason to prepare for much, much more is a good one, so good that I'll just quote the start of their article:
On a normal day Let’s Encrypt issues nearly two million certificates. When we think about what essential infrastructure for the Internet needs to be prepared for though, we’re not thinking about normal days. We want to be prepared to respond as best we can to the most difficult situations that might arise. In some of the worst scenarios, we might want to re-issue all of our certificates in a 24 hour period in order to avoid widespread disruptions. [...]
This preparation is a good thing for at least three reasons. The first reason is that someday Let's Encrypt might actually have an emergency where the current rules for CAs would require them to revoke and thus reissure all of their certificates in a day (the rules are set through the CA/Browser Forum, although in practice set by browsers). Since Let's Encrypt certificates turn over rapidly, a core flaw in (say) parameters in the signed TLS certificates that they issue that wasn't detected for a few months could poison most or all of their current certificates, theoretically forcing mass revocations and reissuing according to the CA/B rules.
The second reason is that if Let's Encrypt (and the clients for it) are prepared for such a mass reissuing, it becomes much more likely that Let's Encrypt will actually do this and the browsers will require them to if such a problem was discovered. If Let's Encrypt could not handle a mass reissuing scenario, there would be at least a lot of practical pressure to not force them to revoke all of those certificates, and to bend the nominal CA/B rules in favor of practical security.
(If Let's Encrypt revoked a mass of certificates that they couldn't reissue on the spot, the two practical effects would be to demonstrate once again how little TLS certificate revocation actually does and to take a potentially significant number of HTTPS websites off the air. Since Firefox checks OCSP status and Chrome doesn't, this would probably also drive more people from Firefox to Chrome, which is not good for the web ecology.)
The third reason is that this puts pressure on all of the other TLS Certificate Authorities out there to also be prepared, and makes it more likely that the browser vendors will force other CAs to live up to the CA/B rules even if this would revoke a lot of current certificates. If Let's Encrypt is prepared, then everyone can point to Let's Encrypt and say 'you should have seen this coming and been ready, just like they were'. It also means that Let's Encrypt may be better placed to absorb a flood of new certificates if some other CA has to do a mass revocation and affected people turn to Let's Encrypt for even one-time TLS certificates to bridge them over.
(Not letting Certificate Authority mistakes and errors slide is good for the overall TLS ecosystem, especially since many CAs are in effect the weakest point in TLS security in practice. In the past this has happened for various reasons, although I think that the CA/B rules (and the browsers) were weaker then.)
PS: Given things like Heartbleed, flaws in Certificate Authority practices aren't the only thing that could trigger a need for a mass revocation and reissuance. Although another issue like Heartbleed should hopefully not have quite as large a blast radius as all of Let's Encrypt's certificates.
The issue of IOPS versus latency on SSDs and NVMe drives
Famously, SSDs and especially NVMe drives are very good at handling random IO, unlike spinning rust. If you look at performance information for drives and Wikipedia information on IOPS, you can find very large and very impressive numbers. You'll also usually find footnotes or side notes to the effect that these numbers are usually achieved with high queue depths and concurrency, in order to keep these voraciously fast storage systems fed at all times with the IO requests they need to deliver maximum performance.
In the process of writing another entry, I was about to confidently turn these IOPS numbers into typical access latencies for SSDs and NVMe drives. Then it occurred to me that this conversion is not necessarily valid, because we're in the old realm of bandwidth versus latency (which I originally encountered in networking). Flooding a drive with all the IO requests it could possibly consume maximizes the 'bandwidth' of IO operations but it doesn't necessarily predict the latency that we would experience if we submitted an isolated request.
(I'm not sure that it lets us predict the average latency experienced by requests either, but I'm on more shaky ground there and I'd want to think hard about this and draw some little diagrams of toy models.)
The latency of isolated requests is probably not a useful number to try to measure for general performance information. One problem is that it's going to depend a lot on the operating system and the overall hardware for a fast SSD or a very fast NVMe drive. Flooding a drive with requests to determine its IOPS 'bandwidth' is relatively system neutral and so relatively reproducible, since all you need to figure out is how to get enough simultaneous requests, but an isolated request latency number is hard to both to use and to verify (or reproduce). Even modest changes in operating system internals could affect how fast a single request can flow through, which means even applying software updates could invalidate previous results and make it impossible to cross-compare with older numbers.
At the same time, the latency of isolated requests is often important for practical system performance, especially as drives get faster and faster (and so spend less and less time with queued requests for a given load level). The latency of isolated random reads is especially relevant, since random reads are often synchronous in practice because some piece of software is waiting for the result and can't proceed without it. For instance, walking through many on-disk data structures (including filesystem directory trees for pathname lookups) is random but almost always synchronous for the code doing it.