How we're dealing with our expiring OpenVPN TLS root certificate
Recently I wrote about my failure to arrange a graceful TLS root certificate rollover for our OpenVPN servers. This might leave you wondering what we're doing about this instead, and the answer is that we've opted to use a brute force solution, because we know it works.
Our brute force solution is to set up a new set of OpenVPN servers (we have two of them for redundancy), using a new official name and with it a new TLS root certificate that is good for quite a while (I opted to be cautious and not cross into 2038) and with it a new host certificate. With the new servers set up and in production, we've updated our support site so use the new official name and the new TLS root certificate, so people who set up OpenVPN from now onward will be using the new environment.
Since these servers are using a new official name, they and the current (old) OpenVPN servers can operate at the same time. People with the new client configuration go through our new servers; people with the old client configuration and old TLS certificate go through our old servers. There's no flag day where we have to change the TLS root certificate on the old servers, and in fact they won't change; we're going to run them as-is right up until the TLS root certificate expires and no one can connect to them any more.
This leaves us with all of the people who are currently using our old OpenVPN servers with the expiring TLS root certificate. We're just going to have to contact all of them and ask them to update (ie change) their client configuration, changing the OpenVPN server name and getting and installing the new TLS root certificate. This is not quite as bad as it might sound, because we were always going to have to contact the current people to get them to update their TLS root certificate. So they only have to do one extra thing, although that extra thing may be quite a big pain.
(Some environments have nice, simple OpenVPN configuration systems. But on some platforms, the configuration is 'open a text editor and ...', and one of them is probably not one you're thinking of.)
Doing the change this way 'costs' us two extra servers for a while, which we have to spare, and more importantly it meant that we needed a new official name for our OpenVPN service. This time around this was acceptable, because our old official name was in retrospect perhaps not the best option. If we have to do this again, we may have a harder time coming up with a good new name, but hopefully next time around we'll be able to roll over the TLS root certificate instead of having to start all over from scratch.
(From my perspective, the most annoying thing about this is that I just rebuilt the OpenVPN servers in January in order to update them to a modern OpenBSD. If I'd known all of this back then, we could have gone straight to our new end state and saved one round of building and installing machines.)
TLS certificate durations turn out to be complicated and subtle
The surprising TLS news of the time interval is that Let's Encrypt made a systematic mistake in issuing all of their TLS certificates for years. Instead of being valid for exactly 90 days, Let's Encrypt certificates were valid for 90 days plus one second. This isn't a violation of the general requirements for Certificate Authorities on how long TLS certificates can be, but it was a violation of Let's Encrypt's own certificate policy.
TLS certificates have a 'not before' and 'not after' times. For ages, Let's Encrypt (and almost everyone else) has been generating these times by taking a start time and adding whatever duration to it. You can see an example of this in some completely unrelated code in my entry on how TLS certificates have two internal representations of time, where the certificate starts and ends on the same hour, minute, and second (19:40:26 in the entry). However, it turns out that the TLS certificate time range includes both the start and the end times; it's not 'from the start time up to but not including the end time'. Since this includes both the second at the start and the second at the end, a simple 'start time plus duration' is one second too long.
(A properly issued literal 90 day certificate from Let's Encrypt now has an ending seconds value that's one second lower than it starts, for example having a not before of 2021-06-10 15:31:37 UTC and a not after of 2021-09-08 15:31:36.)
This is already a tricky issue but the Mozilla bug gets into an even more tricky one, which is fractional seconds. If a certificate has a 'not after' of 15:31:36, is it valid right up until 15:31:37.000, or does it stop being valid at some time after 15:31:36.000 but before 15:31:37.000? The current answer is that it's valid all the way up to but not including 15:31:37.000, per Ryan Sleevi's comment, but there's some discussion of that view in general and it's possible there will be a revision to consider these times to be instants.
(People are by and large ignoring leap seconds, because everyone ignores them.)
All of this careful definition of not before and not after is in the abstract of RFCs and requirements for Certificate Authorities, but not necessarily in what actual software does. Some versions of OpenSSL apparently treat both the not before and not after times as exclusive when validating TLS certificates (cf); the time must be after the not before time and before the not after time. Other software may have similar issues, especially treating the not after time as the point where the certificate becomes invalid. I would like to say that it also doesn't matter in actual practice, but with TLS's luck someone is eventually going to find an attack that exploits this. Weird things happen in the TLS world.
PS: Let's Encrypt's just updated CPS deals with the whole issue by simply saying they will issue certificates for less than 100 days.
PPS: Some certificate reporting software may not even print the seconds for the not before and not after fields. I can't entirely blame it, even though that's currently a bit inconvenient.