Wandering Thoughts

2023-02-07

What I want in Prometheus (as a whole) is aggregating alert notifications

I recently looked at Prometheus's new feature to keep alerts firing for a while (often to avoid flapping alerts) and in the process realized that it wasn't really what I want. The simple way to put it is what I care about getting less of is not the alerts themselves, but alert notifications. And for that, what I really want is for notifications that can (at some point) aggregate together information about multiple alerts over time. Instead of getting one notification each time the alert triggers, perhaps I would get one notification every twenty minutes telling me, say, that the alert triggered three times in the last twenty minutes for a total of seven minutes when it was active (and I can look at a dashboard if I want to know exactly when). This preserves relatively precise alert times in Prometheus itself while not dumping too many notifications on us.

(Specifically it preserves accurate details about when alerts were firing in the metrics database.)

This aggregation obviously can't happen in Prometheus itself; Prometheus cares about alerts, not alert notifications. It also doesn't really fit in the current model for Alertmanager. Alert aggregation over time and how to present it in notifications is a complex area; trying to put this in Alertmanager would add a lot of complication to a core component that a lot of people are pretty happy with today (us included). Practically speaking this probably needs to be a separate component that will keep its own ongoing database of (recent) past alerts, notification times, and so on; the obvious implementation approach today would be as an Alertmanager webhook.

(The list of webhook receivers includes a logging one and one that dumps alerts into MySQL, which I want to note since I've looked at it now.)

With that said, you can do some alert aggregation today in Prometheus if you're willing to have 'alerts' that don't always turn off (or perhaps turn on) when the underlying condition does. You can, for example, suppress or extend an alert when it has triggered enough times in the recent past through creative use of the changes() function (and I mentioned this possibility back in my entry on maybe avoiding flapping alerts). This will indirectly 'aggregate' notifications about the alert triggering and resolving by not actually resolving and then re-triggering the alert.

Within Alertmanager, your only 'aggregation' choice today is a long group_interval. This may be tolerable if you don't care about getting relatively promptly notified about resolved alerts. Unfortunately, from what I remember of the Alertmanager code involved here it would be hard to have a second version of group_interval that only applied to resolved alerts.

(I wouldn't say that Alertmanager is 'stateless', but it does try to keep relatively little state, especially once things are over. This is sensible if you're in a large scale environment where a ton of alerts from a ton of different groups go sluicing through the system.)

Since there's no way to do it today, I haven't thought very much about what we'd want in a hypothetical alert notification aggregation environment. There's an obvious tradeoff between prompt notifications of a new situation and aggregating quick-cycling alerts together, so we'd probably want no aggregation to happen until an alert had bounced around 'too much' in the recent past. Or maybe this should be phrased as 'only send N notifications about any particular group of alerts in X minutes', so you'd have an initial notification budget that could be used up in individual alert and resolution notifications, but once you'd hit the rate limit, things would get aggregated.

(Rate-limiting separate alert notifications strikes me as useful mental model, although as mentioned I wouldn't want rate limited notifications to disappear entirely; I'd want some sort of summary of them. A crude approach would be to append all of the individual notifications together, following the old model of getting mailing list messages in periodic digests.)

PrometheusAlertsAndAggregation written at 22:57:48; Add Comment

2023-02-06

Rsync'ing (only) some of the top level pieces of a directory

Suppose, not hypothetically, that you have a top level directory which contains some number of subdirectories, and you want to selectively create and maintain a copy of only part of this top level directory. However, what you want to copy over changes over time and you want un-wanted things to disappear on the destination (because otherwise they'll stick around using up space that you need for things you care about). Some of the now-unwanted things will still exist on the source but you don't want them on the copy any more; others will disappear entirely on the source and need to disappear on the destination too.

This sounds like a tricky challenge with rsync but it turns out that there is a relatively straightforward way to do it. Let's say that you want to decide what to copy based (only) on the modification time of the top level subdirectories; you want a copy of all recently modified subdirectories that still exist on the source. Then what you want is this:

cd /data/prometheus/metrics2
find * -maxdepth 0 -mtime -365 -print |
 sed 's;^;/;' |
  rsync -a --delete --delete-excluded \
        --include-from - --exclude '/*' \
        . backupserv:/data/prometheus/metrics2/

Here, the 'find' prints everything in the top level directory that's been modified within the last year. The 'sed' takes that list of names and sticks a '/' on the front, turning names like 'wal' into '/wal', because to rsync this definitely anchors them to the root of the directory tree being (recursively) transferred (per rsync's Pattern Matching Rules and Anchoring Include/Exclude Patterns). Finally, the rsync command says to delete now-gone things in directories we transfer, delete things that are excluded on the source but present on the destination, include what to copy from standard input (ie, our 'sed'), and then exclude everything that isn't specifically included.

(All of this is easier than I expected when I wrote my recent entry on discovering this problem; I thought I might have to either construct elaborate command line arguments or write some temporary files. That --include-from will read from standard input is very helpful here.)

If you don't think to check the rsync manual page, especially its section on Filter Rules, you can have a little rsync accident because you absently think that rsync is 'last match wins' instead of 'first match wins' and put the --exclude before the --include-from. This causes everything to be excluded, and rsync will dutifully delete the entire multi-terabyte copy you made in your earlier testing, because that's what you told it to do when you used --delete-excluded.

(In general I should have carefully read all of the rsync manual page's various sections on pattern matching and filtering. It probably would have saved me time, and it would definitely have left me better informed about how rsync actually behaves.)

RsyncRecentDirectoryContents written at 23:08:38; Add Comment

2023-02-05

Some things on Prometheus's new feature to keep alerts firing for a while

In the past I've written about maybe avoiding flapping Prometheus alerts, which is a topic of interest to us for obvious reasons. One of the features in Prometheus 2.42.0 is a new 'keep_firing_for' setting for alert rules (documented in Recording rules, see also the pull request). As described in the documentation, it specifies 'how long an alert will continue firing after the condition that triggered it has cleared' and defaults to being off (0 seconds).

The obvious use of 'keep_firing_for' is to avoid having your alerts flap too much. If you set it to some non-zero value, say a minute, then if the alert condition temporarily goes away only to come back within a minute, you won't potentially wind up notifying people that the alert went away then notify them again that it came back. I say 'potentially', because when you can get notified about an alert going away is normally quantized by your Alertmanager group_interval setting. This simple alert rule setting can replace more complex methods of avoiding flapping alerts, and so there are various people who will likely use it.

When 2.42.0 came out recently with this feature, I started thinking about whether we would use it. My reluctant conclusion is that we probably won't in most places, because it doesn't do quite what we want and it has some side effects that we care about (although these side effects are the same as most of the other ways of avoiding flapping alerts). The big side effect is that this doesn't delay or suppress notifications about the alert ending, it delays the alert itself ending. The delay in notification is a downstream effect of the alert itself remaining active. If you care about being able to visualize the exact time ranges of alerts in (eg) Grafana, then artificially keeping alerts firing may not be entirely appealing.

(This is especially relevant if you keep your metrics data for a long time, as we do. Our alert rules evolve over time, so without a reliable ALERTS metric we might have to go figure out the historical alert rule to recover the alert end time for a long-past alert.)

This isn't the fault of 'keep_firing_for', which is doing exactly what it says it does and what people have asked for. Instead it's because we care (potentially) more about delaying and aggregating alert notifications than we do about changing the timing of the actual alerts. What I actually want is something rather more complicated than Alertmanager supports, and is for another entry.

PrometheusOnExtendingAlerts written at 22:55:15; Add Comment

2023-02-04

The practical appeal of a mesh-capable VPN solution

The traditional way to do a VPN is that your VPN endpoint ('server') is the single point of entry for all traffic from VPN clients. When a VPN client talks to anything on your secured networks, it goes through the endpoint. In what I'm calling a mesh-capable VPN, you can have multiple VPN endpoints, each of them providing access to a different network area or service. Because it's one VPN, you still have a single unified client identity and authentication and a single on/off button for the VPN connection on clients.

(WireGuard is one of the technologies that can be used to implement a mesh-capable VPN. WireGuard can be used to build a full peer to peer mesh, not just a VPN-shaped one.)

A standard, non-mesh VPN is probably going to be simpler to set up and it gives you a single point of control and monitoring over all network traffic from VPN clients. Despite that, I think that mesh-capable VPNs have some real points of appeal. The big one is that you don't have to move all of your VPN traffic through a single endpoint. Instead you can distribute the load of the traffic across multiple endpoints, going right down to individual servers for particular services. As an additional benefit, this reduces the blast radius of a given VPN endpoint failing, especially if you give critical services their own on-machine VPN endpoints so that if the service is up, people can reach it over the VPN.

This is probably not a big concern if your VPN isn't heavily or widely used. It becomes much more important if you expect many people to access most of your services and resources over your VPN, for example because you've decided to make your VPN your primary point of Multi-Factor Authentication (so that people can MFA once to the VPN and then be set for access to arbitrary internal services). If you're expecting hundreds of people to send significant traffic through your VPN to heavily used services, you're normally looking at a pretty beefy VPN server setup. If you can use a mesh-capable VPN to offload that traffic to multiple endpoints, you can reduce your server needs. If you can push important, heavy traffic down to the individual servers involved, this can take your nominal 'VPN endpoint' mostly out of the picture except for any initial authentication it needs to be involved in.

Another feature of a mesh-capable VPN is that the VPN endpoints don't even have to all be on the same network. For example, if you split endpoints between internal and external traffic, you could put the external traffic VPN endpoint in a place that's outside of your regular network perimeter (and so isn't contending for perimeter bandwidth and firewall capacity). In some environments you wouldn't care very much about external traffic and might not even support it, but in our environment we need to let people use our IPs for their outgoing traffic if they want to.

A mesh-capable VPN can also be used for additional tricks if you can restrict access to parts of the mesh based on people's identities. This can be useful to screen access to services that have their own authentication, or to add authentication and access restrictions to services that are otherwise open (or at least have uncomfortably low bars on their 'authentication', and perhaps you don't trust their security). If you can easily extract identification information from the VPN's mesh, you could even delegate authentication to the VPN itself rather than force people to re-authenticate to services.

(In theory this can be done with a normal VPN endpoint too, but in practice there are various issues, including a trust issue where everyone else has to count on the VPN endpoint always assigning the right IP to the right person and doing the access restrictions right. In practice people will likely be more comfortable with a bunch of separate little VPNs; there's the general use one, the staff one, the one a few people can use to get access to the subnet of laboratory equipment that has never really heard of 'access control', and so on.)

VPNMeshAppeal written at 22:15:28; Add Comment

2023-02-02

A gotcha when making partial copies of Prometheus's database with rsync

A while back I wrote about how you can sensibly move or copy Prometheus's time series database (TSDB) with rsync. This is how we moved our TSDB, with metrics data back to late 2018, from a mirrored pair of 4 TB HDDs on one server to a mirrored pair of 20 TB HDDs on another one. In that entry I also mentioned that we were hoping to use this technique to keep a partial backup of our TSDB, one that covered the last year or two. It turns out that there is a little gotcha in doing this that makes it trickier than it looks.

The idea way to do such a partial backup was if rsync could exclude or include files based on their timestamp. Unfortunately, as far as I know it can't do that. Instead the simple brute force way is to use find to generate a list of what you want to copy and feed that to rsync:

cd /data/prometheus/metrics2
rsync -a \
   $(find * -maxdepth 0 -mtime -365 -type d -print) \
   backupserv:/data/prometheus/metrics2/

As covered (more or less) in the Prometheus documentation on local storage, the block directories in your TSDB are frozen after a final 31-day compaction, and conveniently their final modification time is when that last 31-day compaction happened. The find with '-maxdepth 0' filters the command line arguments down to only things a year or less old; this catches the frozen block directories for the past year (and a bit), plus the chunks_head directory of the live block and the wal directory of the write-ahead log.

However, it also captures other block directories. Blocks initially cover two hours, but are then compacted down repeatedly until they eventually reach their final 31-day compaction. During this compaction process you'll have a series of intermediate blocks, each of which is a (sub)directory in your TSDB top level directory. Most of these intermediate block directories will be removed over time. Well, they'll be removed over time in your live TSDB; if you replicate your TSDB over to your backupserv the way I have, there's nothing that's going to remove them on your backup. These directories for intermediate blocks will continue to be there in your backup, taking up space and containing duplicate data (which may cause Prometheus to be unhappy with your overall TSDB if you ever have to use this backup copy).

This can also affect you if you repeatedly rsync your entire TSDB without using '--delete'. Fortunately I believe I used 'rsync -a --delete' when moving our TSDB over.

The somewhat simple and relatively obviously correct approach to dealing with this is to send over a list of the directories that should exist to the backup server, and have something on the backup server remove any directories not listed. You'd want to make very sure that you've sent and received the entire list, so that you don't accidentally remove actually desired bits of your backups.

The more tricky approach would be to have rsync do the deletion as part of the transfer. Instead of selectively transferring named directories on the command line, you'd build an rsync filter file that only included directories that were the right age to be transferred, and then use that filter as you transferred the entire TSDB directory with rsync's --delete-excluded argument. This would automatically clean up both 31-day block directories that were now too old and young block directories that had been compacted away.

(You'd still determine the directories to be included with find, but you'd have to do more processing with the result. You could also look for directories that were too old, and set up an rsync filter that excluded them.)

I'm not sure what approach we'll use. I may want to prototype both and see which feels more dangerous. The non-rsync approach feels safer, because I can at least have the remote end audit what it's going to delete for things that are clearly wrong, like deleting a directory that's old enough that it should be a frozen, permanent one.

(Possibly this makes rsync the wrong replication tool for what I'm trying to do here. I don't have much exposure to alternates, though; rsync is so dominant in this niche.)

PrometheusMovingTSDBWithRsyncII written at 22:56:47; Add Comment

2023-01-30

One reason I still prefer BIOS MBR booting over UEFI

Over on the Fediverse I said something I want to elaborate on:

One of the reasons I still prefer BIOS MBR booting over UEFI is that UEFI firmware is almost always clever and the failure mode of clever is 💥. I dislike surprises and explosions in my boot process.

Old fashioned BIOS MBR booting is very simplistic but it's also very predictable; pretty much the only variable in the process is which disk the BIOS will pick as your boot drive. Once that drive is chosen, you'll know exactly what will get booted and how. The MBR boot block will load the rest of your bootloader (possibly in a multi-step process) and then your bootloader will load and boot your Unix. If you have your bootloader completely installed and configured, this process is extremely reliable.

(Loading and booting your Unix is possibly less so, but that's more amenable to tweaking and also understandable in its own way.)

In theory UEFI firmware is supposed to be predictable too. But even in theory it has more moving parts, with various firmware variables that control and thus change the process (see efibootmgr and UEFI boot entries). If something changes these variables in a way you don't expect, you're getting a surprise, and as a corollary you need to inspect the state of these variables (and find what they refer to) in order to understand what your system will do. In practice, UEFI firmware in the field at least used to do weird and unpredictable things, such as search around on plausible EFI System Partitions, find anything that looked bootable, 'helpfully' set up UEFI boot entries for them, and then boot one of them. This is quite creative and also quite unpredictable. What will this sort of UEFI firmware do if part of the EFI System Partition gets corrupted? Your guess is as good as mine, and I don't like guessing about the boot process.

(There's a wide variety of corruptions and surprises you can have with UEFI. For example, are you sure which disk your UEFI firmware is loading your bootloader from, if you have more than one?)

In theory UEFI could simplify your life by letting you directly boot Unix kernels. In practice you want a bootloader even on UEFI, or at least historically you did and I doubt that the core issues have changed recently since Windows also uses a bootloader (which means that there's no pressure on UEFI firmware vendors to make things like frequent updates to EFI variables work).

It's possible that UEFI firmware and the tools to interact with it will evolve to the point where it's solid, reliable, predictable, and easy to deal with. But I don't think we're there yet, not even on servers. And it's hard to see how UEFI can ever get as straightforward as BIOS MBR booting, because some of the complexity is baked into the design of UEFI (such as UEFI boot entries and using an EFI System Partition with a real filesystem that can get corrupted).

BIOSMBRBootingOverUEFI written at 22:40:48; Add Comment

2023-01-27

Some thoughts on whether and when TRIM'ing ZFS pools is useful

Now that I've worked out how to safely discard (TRIM) unused disk blocks in ZFS pools, I can think about if and when it's useful or important to actually do this. In theory, explicitly discarding disk blocks on SSDs speeds up their write performance because it gives the SSD more unused flash storage space it can pre-erase so the space is ready to be written into. So the first observation is that how much TRIM'ing a pool matters depends on how much you're writing to it (well, to filesystems and perhaps zvols in it). If you're writing almost nothing to the pool, you have almost no need of fresh chunks of flash storage.

(As far as I know, TRIM'ing SSDs isn't normally expected to speed up their read performance.)

Next, the amount of help you can get from TRIM'ing SSDs depends on how much space is unused in your ZFS pools, because ZFS can only TRIM unused space. If your pool is 90% full, only 10% of the disk space can be TRIM'd at all. This implies that there's little point in TRIM'ing an almost completely full pool (if you let your pools get that full). On the positive side, triggering a ZFS TRIM of devices in that pool will go quite fast.

(On the negative side, if you scrub after the TRIM, it may take a while because you have lots of data.)

A pretty full pool can still see a significant write volume if people are overwriting existing data, or churning through creating, removing, and recreating files. If you trust ZFS's TRIM support, you might TRIM regularly in order to try to give your SSDs as much explicitly unused space to work with as possible (or even set autotrim on in the pool). On the other hand, if write performance is important to you, you probably should buy bigger SSDs; in general they'll have more headroom for writes.

(I believe that you can preserve this headroom by partitioning the SSDs and only using part of them. Our ZFS fileservers effectively get this some of the time for some SSDs, because we divide our SSDs into standard sized partitions and then use the partitions in ZFS pool vdevs. If a partition isn't assigned to a pool, it will only be written to if it's activated as a spare.)

A relatively ideal case for using TRIM would be a ZFS pool that's not too full but that sees a significant amount of writes through churn in its data, either through overwriting existing data or through creating and deleting files. You would get the former from things like hosting active virtual machines (which overwrite their virtual disks a lot) and the latter from frequently compiling things in a source tree.

(Because ZFS never overwrites data in place, even repeatedly updating the same blocks in a file (such as a virtual disk image) will eventually write all over the (logical) disk blocks on the SSD and force the SSD to consider them as having real data. Filesystems that will overwrite data in place don't have this behavior, so the SSD may get to keep a lot more logical blocks marked as 'has never been written to'.)

Given ZFS's copy on write behavior, I suspect that it's useful to periodically TRIM even low write volume, relatively empty ZFS pools. This depends a fair bit on how much ZFS's reuses disk space over a long time period, but TRIM'ing is probably relatively harmless. It's probably harmless to repeatedly TRIM pools that have low write levels and plenty of space free, but it's also probably not really necessary; with low write volume, mostly what you'll be doing is telling the SSDs things they already know (that the block you TRIM'd before and haven't written to since then is still unused).

For our ZFS fileservers, we're in the process of migrating from 2 TB SSDs to new 4 TB SSDs, which effectively resets the 'TRIM clock' for everything and gives us much more headroom in the form of completely unused partitions. Given this I don't think we're likely to try to TRIM our pools any time soon. Perhaps someday we'll use our metrics system to compare write performance from a year or three ago to write performance today, notice that it's clearly down for some things, and TRIM them.

PS: Much of this logic applies to any filesystem on SSDs, not just ZFS, although ZFS's copy-on-write makes it worse in that it's more likely to touch more of the SSD's logical blocks than other filesystems.

ZFSTrimUsefullnessQuestion written at 21:10:57; Add Comment

2023-01-23

I should always make a checklist for anything complicated

Today I did some work on the disk setup of my home desktop and I got shot in the foot, because when you remove disks from Linux software RAID arrays and then reboot, the boot process may reassemble those RAID arrays using the disks you removed (or even just one disk), instead of the actual live disks in the RAID array. There are a number of reasons that this happened to me, but one of them is that I didn't make a checklist for what I was doing and instead did it on the fly.

I had a pair of bad justifications for why I didn't write out a checklist. First, I was doing this to my home desktop, not one of our servers at work, and it felt silly to go through the same process for a less important machine (never mind that it's a very important machine to me, especially when I'm working from home at the time). Second, I hadn't planned in advance to make this change; it was an on the fly impulse because I was rebooting the machine anyway for a kernel update. I figured I was experienced with software RAID and I could remember everything I needed to. Obviously I was wrong; this is an issue that I've had at least twice before, and the moment it happened I realized what had gone wrong (but by then it was too late to fix it easily).

(The first time this happened was to one of my desktops but I'm not sure which of them. The second time was when I replaced a bad disk on my home desktop in 2019, and I seem to have forgotten the earlier time when I wrote that entry, since it has no pointer to the the first one.)

I knew my software RAID changes were a multi-step process and there were uncertainties in the process. But I didn't take the next mental step to 'I should write up even a trivial little checklist', and so I paid for it with some excitement. Although there were positive bits in the end result, I would still have been better off writing out that checklist, even if I was doing everything on impulse.

If it's not trivial, I should make a checklist even on my home desktop, even if it feels weird. Checklists are a great thing and I should use them more often. Even if I don't completely follow the checklist, making it will make me think through everything and that has a much higher chance of jogging my mind about thing's I've already encountered before.

I have some more disk work to do as a result of the other hardware changes I also did (I'm replacing my remaining spinning rust), and they're not trivial. Since they're not trivial, I'll hopefully write out a checklist this time around instead of winging it. As I've been reminded today, it's too easy to forget things when you're working on the fly, and even the small stuff can benefit from not making mistakes.

(I should probably write more checklists when doing things on my desktops, but in my defense I rarely touch them for anything more intricate than Fedora kernel updates. Normal people have simple kernel updates, apart from needing to reboot; I complicate my life by also updating my ZFS on Linux version at the same time. At least it's simpler than it used to be.)

PS: These days I do make a checklist for my process of upgrading Fedora versions, partly because the post-upgrade steps have gotten complicated enough. Although now that I'm writing this, I have to admit that my current checklists are only for the post-upgrade parts. I should update the checklist to include the pre-upgrade and during upgrade parts, especially since I have in the past forgotten to do some of them.

AlwaysMakeAChecklist written at 22:31:35; Add Comment

2023-01-22

How Let's Encrypt accounts are linked to your certificates in Certbot

If life is simple, every machine you run will have its own Let's Encrypt account and you'll never do things like copy or move a TLS certificate (and possibly much or all of /etc/letsencrypt) from one machine to another. If you do wind up moving LE TLS certificates and perhaps all of Certbot's /etc/letsencrypt, you can wind up with shared Let's Encrypt accounts or stranded TLS certificates, and you may want to straighten this out. Certbot doesn't really document how accounts are set up and how they connect to certificates, that I've seen, so here are notes on the pragmatic bits I've had to work out.

In theory, starting from Certbot 1.23 you can find out information about your accounts with 'certbot show_account'. In practice, Ubuntu 22.04 LTS still has Certbot 1.21, and show_account doesn't show you one critical piece of information, namely Certbot's local identifier for the account. So instead you have to look under /etc/letsencrypt, where in accounts/acme-v02.api.letsencrypt.org/directory/ you will find one subdirectory per production LE account you have. Each account (ie subdirectory) has a name that's 32 hex digits, which is Certbot's (internal) name for this account. In each account's subdirectory, the meta.json will give you some basic information about the account, currently the creation date and hostname, although not necessarily the email address associated with it (which 'certbot show_account' can retrieve from Let's Encrypt).

Issued TLS certificates aren't directly tied to a Let's Encrypt account by Certbot. Instead, what's tied to the account is the renewal. Each TLS certificate has a /etc/letsencrypt/renewal/<name>.conf file, and one of the things listed in each file is the account that Certbot will try to use to renew the certificate:

# Options used in the renewal process
[renewalparams]
account = baf3e1c5a7[...]
authenticator = standalone
[...]

If the account isn't found under /etc/letsencrypt/accounts at renewal time, Certbot will fail with an error. To change the account used for renewal, you just edit the 'account =' line, which is where you really want to know the Certbot account name (those 32 hex digits) of the right account. As far as I know there is no Certbot command to do this by itself, although possibly if you re-request a TLS certificate for the names, Certbot will update the configuration file to use the account you have available.

If you have more than one Certbot account on a host (for example because you merged a locally created /etc/letsencrypt with one from another server), Certbot commands like 'certbot certonly' will pause to ask you what account to use (presenting you with useful information about each account, so you can make a somewhat informed choice). If this is annoying to you, you need to remove all but one account and then make sure all of your TLS certificates are being renewed by that account, generally by editing their files in /etc/letsencrypt/renewal.

(I understand why Certbot is this way, but I wish there was a 'certbot fixup' command that would just do all of these updates for you. Along with a Certbot command specifically to change the status of certificate renewal between 'standalone' and 'webroot'. It would make life simpler for system administrators, or at least us.)

LetsEncryptCertbotAccounts written at 21:44:43; Add Comment

2023-01-15

Some weird effects you can get from shared Let's Encrypt accounts

Recently on the Fediverse I said:

How Let's Encrypt (well, the ACME protocol) handles proving 'you' have control over a domain when you request a TLS certificate involving it has some interesting potential effects if you move your websites from one machine to another.

(Authorization is tied to a LE account, not a host, and at least used to last for 60 days. If a LE account is shared between hosts, all of them are fishing in the same authorization pool, regardless of who has the website.)

(Per Matthew McPherrin, how long authorizations last is now down to 30 days and may drop further.)

To get TLS certificates from Let's Encrypt, you must create and register an 'account', which is really a keypair and some associated information. The normal practice is to have a separate LE account for each machine that you use to get TLS certificates, and I think this is a good idea, because authorization to issue TLS certificates for a given name is tied to the account, not to a host. If you move a (HTTPS) website from one host to another, there are two interesting effects that can happen.

(I'm supposing here that the old server still keeps operating with some websites.)

Sometimes the easiest approach is to copy /etc/letsencrypt (or the equivalent if you're not using Certbot) from your old multi-website server to the new server that will have some of those websites. This results in your two hosts sharing the same account, which means that each of them can probably get TLS certificates for the merged collection of websites. If you moved website X from host A to host B, host A can't directly get authorized for X any more, but once B gets the shared account authorized, A will be able to renew its TLS certificate for X. If renewals for the TLS certificates are close enough (as they probably will be if you just copied the TLS certificates over), A may fail an initial authentication attempt, but then B will do one and A's attempt to renew the TLS certificate will now work.

(You can also have weirder situations. For example, suppose that you're using the DNS authorization method, and host A has the appropriate permissions to trigger the necessary DNS updates but host B doesn't. In this case it's host B's renewal that may initially fail, then succeed the next time around, after A has authorized the shared account even though it no longer holds the website.)

If you have different Let's Encrypt accounts on the two hosts, you can still potentially have some issues. Back in the days of extended authorization periods (60 days or more), you could get a situation where the old host might be able to automatically renew the moved TLS certificate once, because its authorization was still (just) valid. These days that's harder to contrive in an automatic renewal, but with the right timing you might be able to manually renew or re-issue the TLS certificate for the moved website on the old host for a while after the move.

(For instance, suppose that you force an out of cycle certificate renewal just before the move, which as a side effect gets you a 30 authorization on the old host.)

Both of these cases are of interest to system administrators because they make it look like things are working when in fact they aren't. You just won't find out about the problems for a little while longer (or you're passing the problems off as 'LE renewal glitches, nothing to worry about since they cure themselves in a few days'). Generally if you have separate Let's Encrypt accounts you're more likely to notice real problems, whether those are that host A is still trying to renew the TLS certificate for the moved website or that host B doesn't have the necessary setup to actually prove control.

LetsEncryptSharedAccountEffects written at 22:34:18; Add Comment

(Previous 10 or go back to January 2023 at 2023/01/14)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.