2019-05-12
What we'll want in a new Let's Encrypt client
Over on Twitter, I said:
It looks like we're going to need a new Let's Encrypt client to replace acmetool (which we love); acmetool uses the v1 API and seems to no longer be actively developed, and the v1 API runs into problems in November: <link: End of Life Plan for ACMEv1>
(There is an unfinished ACMEv2 branch of acmetool, but, and also. It would be ideal if the community stepped forward to continue acmetool development, but sadly I don't see signs of that happening so far and I can't help with such work myself.)
November is when Let's Encrypt will turn off new account registrations through ACMEv1, which is a problem for us because we don't normally re-use Let's Encrypt accounts (for good reasons, and because it's easier). So in November, we would stop being able to install acmetool on new machines without changing our procedures to deliberately reuse accounts. Since doing so would only prolong things, we should get a new client instead. As it happens, we would like something that is as close to acmetool as possible, because acmetool is basically how we want to handle things.
Rather than try to write a lot of words about why we like acmetool
so much (with our custom configuration file),
I think it's simpler to demonstrate it by showing you the typical
install steps for a machine:
apt-get install acmetool mkdir /var/lib/acme/conf cp <master>/responses /var/lib/acme/conf/ acmetool quickstart acmetool want NAME1 ALIAS2 ...
(Alternately, we copy /var/lib/acme
from the live version of the
server. We may do both, using 'acmetool
want
' during testing and then overwriting it with the official
version when we go to production.)
After this sequence, we have a new Let's Encrypt account, a cron job that automatically renews
certificates at some random time of the day when they are 30 days
(or less) from expiry, and a whole set of certificates, intermediate
chains, and keys accessible through /var/lib/acme/live/<NAME1>/
and so on, with appropriate useful permissions (keys are root only
normally, but everything else is generally readable). When a
certificate is renewed, acmetool
will reload or restart any
potentially certificate-using service that is active on the machine.
If we want to add additional certificates for different names,
that's another 'acmetool want NAME2
' (and then the existing cron
job automatically renews them). All of this works on machines that
aren't running a web server as well as machines that are running a
properly configured one (and these days the Ubuntu 18.04 acmetool
package sets that up for Apache).
(We consider it a strong feature that acmetool doesn't otherwise attempt to modify the configurations of other programs to improve their ability to automatically do things with Let's Encrypt certificates.)
Acmetool accomplishes this with a certain amount of magic. Not only does it keep track of state (including what names you want certificates for, even if you haven't been able to get them yet), but it also has some post-issuance hook scripts that do that magic reloading. The reloading is blind (if you're running Apache, it gets restarted whether or not it's using TLS or acmetool's certificates), but this hasn't been a problem for us and it sure is convenient.
We can probably duplicate a lot of this by using scripts on top of some other client, such as lego. But I would like us to not need a collection of home-grown scripts (and likely data files) to mimic the simplicity of operation that acmetool provides. Possibly we should explore Certbot, the more or less officially supported client, despite my long-ago previous experiences with it as a heavyweight, complex, and opinionated piece of software that wanted to worm its way into your systems. Certbot seems like it supports all of what we need and can probably be made to cooperate, and it has a very high chance of continuing to be supported in the future.
(A lot of people like minimal Let's Encrypt clients that leave you to do much of the surrounding work yourself. We don't, partly because such additional work adds many more steps to install instructions and opens the door to accidents like getting a certificate but forgetting to add a cron job that renews it.)
(My only experimentation with Certbot were so long ago that it wasn't called 'certbot' yet. I'm sure that a lot has changed since then, and that may well include the project's focus. At the time I remember feeling that the project was very focused on people who were entirely new to TLS certificates and needed a great deal of hand-holding and magic automation, even if that meant Certbot modifying their system in all sorts of nominally helpful ways.)
Committed address space versus active anonymous pages in Linux: a mystery
In Linux, there are at least two things that can happen when your system runs out of memory (or the kernel at least thinks it has); the kernel can activate the Out-of-Memory killer, killing one or more processes but leaving the rest alone, or it can start denying new allocation requests, which causes a random assortment of programs to start failing. As I found out recently, systems with strict overcommit on can still trigger the OOM killer, depending on your settings for how much memory the system uses (see here). Normally systems with strict overcommit turned off don't get themselves into situations where they're so out of memory that they start denying allocation requests.
Starting early this morning, some of our compute servers have periodically been reporting 'out of memory, cannot allocate/fork/etc' sorts of errors. There are two things that make this unusual. The first is that these are single-user compute servers, where we turn strict overcommit off; as a result, I would expect them to trigger the OOM killer but never actually run out of memory and start refusing allocations. The second is that according to all of the data I have, these machines have only modest and flat use of committed address space, which is my usual proxy for 'how much memory programs have allocated'.
(The kernel tracks committed address space even when strict overcommit is off, and while it doesn't necessarily represent how much memory programs actually need, it should normally be an upper bound on how much they can use. In fact until today I would have asserted that it definitely was.)
These machines have 96 GB of RAM, and during an incident I can see
the committed address space be constant at 3.7 GB while /proc/meminfo
's
MemAvailable declines to 0 and its Active and Active(anon) numbers
climb up to 90 GB or so. I find this quite mysterious, because as
far as I understand Linux memory accounting, it should be impossible
to have anonymous pages that are not part of the committed address
space. You get anonymous pages by operations such as a MAP_ANONYMOUS
mmap()
, and those are exactly the operations that the kernel is
supposed to carefully account for in working out Committed_AS,
for obvious reasons.
Inspecting /proc/<pid>/smaps
and other data for
the sole gigantic Python process currently running on such a machine
says that it has a resident set size of 91 GB, a significant number
of 'rw-' anonymous mappings (roughly 96 GB worth, mostly in 64
MB mappings), and on hand inspection, a surprising number of those
mappings have a VmFlags: field that does not have the ac
flag
that apparently is associated with an 'accountable area' (per the
proc(5)
manpage
and other documentation). I don't know if not having an ac
flag
causes an anonymous mapping to not count against committed address
space, but it seems plausible, or at least the best theory I currently
have.
(It would help if I could create such mappings myself to test what
happens to the committed address space and so on, but so far I
have only a vague theory that perhaps they can be produced through
use of mremap()
with MAP_PRIVATE
and MREMAP_MAYMOVE
on a
MAP_SHARED
region. This is where I need to write a C test program,
because sadly I don't think I can do this through something like
Python. Python can do a lot of direct OS syscall testing, but playing
around with memory remapping is asking a bit much of it)