We rebooted all of our servers remotely (more or less) and it all worked
Even under normal circumstances, we don't routinely reboot our Linux servers. Reboots are disruptive to our users (especially to the people who are logged in to the servers that reboot), and local policies require us to schedule an after-hours downtime for large scale user visible things like this, which is disruptive to our lives. We do reboot them periodically, either for significant enough Ubuntu kernel security issues or just because we want to get them back on up to date kernels. However, all of this is under normal circumstances, when we are actually physically in the office to deal with machines that fail to go down or come up cleanly.
The current situation is not normal. We've been out of the office since early March, and even in March our systems had been rather a while since a reboot (through the magic of our Prometheus metrics and dashboards, I can tell you that at the end of March 12th, the last day I was in the office, most of our systems had been up for about 259 days). Since we were out of the office, we didn't even think about rebooting for a very long time, and by early September many of our machines had been up for over 400 days without a reboot. Things reached a critical point and we (by which I mean my co-workers, as I was on vacation) decided that we should take the risk to reboot everything, while taking some steps to mitigate the risks for very important machines.
(Said steps being that the reboot of those machines was scheduled for early morning, when a co-worker who is an extreme morning person would stop by the office.)
What happened was, well, nothing. Everything rebooted quietly, everything came back up again without problems, and I believe that the co-worker in the office didn't need to do anything. The not as user visible machines that we rebooted before hand all worked, the user visible machines that we rebooted during the downtime worked too, and none of our fears came to pass.
(Well, we did discover a machine or two with odd BIOS settings that caused problems, but they weren't particularly user visible machines; they were generic machines in our SLURM cluster.)
It would be nicer to have remote power control and a KVM over IP setup for all of our machines, so that we could deal with everything from home; that would make reboots almost completely risk free (and an unexpected hardware failure is hard to deal with during a scheduled downtime anyway). But it's reassuring to have a positive experience even just with the basics. It will also probably encourage us to do it again, sooner than last time around.
(Also, it's nice for a potentially risky operation to just work. A quiet day is its own reward, and sometimes our small successes deserve a little celebration.)
Using SPF on HELO/EHLO hostnames is repurposing SPF to validate a different thing
Back in June I discovered that in theory we should have SPF records for EHLO hostnames too. The conventional explanation for this (apart from 'big email providers say so', the usual reason to do anything in modern SMTP) comes from, for example, this writeup of small mailserver best current practices, and goes like this (I'm paraphrasing):
People use SPF to validate the envelope sender domain (the SMTP MAIL FROM). However, when you send a bounce, it has a null sender and thus no sender domain to use for SPF checks. So the sender domain is taken from the EHLO hostname, for lack of a better place to get it from (since there is no SMTP level 'the bounce claims to have been sent by X domain' information to be had, although this is commonly in the message headers).
This is of course kind of bogus. What is really happening here is that receiving mail servers are attempting to validate that the EHLO/HELO hostname itself is not forged and are using SPF for this purpose. This is a complete repurposing of SPF, which we can see since 'Sender' is right in the name 'Sender Policy Framework' and there's no 'sender' of the bounce that is visible at the SMTP level (and no entirely standard way that it's visible in the mail headers, either).
There are some lessons here for email related 'standards' and in general any Internet standards, which I can summarize this way: if there's a hole that people think needs filling, any nearby peg will get hammered into it regardless of what the peg was originally designed for.
PS: This elaborates on a recent tweet of mine that was sparked by adding SPF DNS records for our EHLO hostnames (and writing the official explanation of the change for our records).
Sidebar: This explanation and RFC 7208
RFC 7208 says two things in section 2.3, The "HELO" Identity:
- it's RECOMMENDED that you check the HELO identity (but carefully)
all the time, and do so before checking MAIL FROM.
- if the envelope sender is the null sender, the message is presumed to come from 'postmaster@<HELO name>' and this is used as the MAIL FROM to check (even if you already checked the HELO identity).
I haven't gone through RFC 7208's section on doing SPF checks to see if it treats HELO and MAIL FROM checks somewhat differently in its algorithm, because frankly I'm not interested enough.
This means that RFC 7208 itself is a superset of the conventional explanation I summarized above. In RFC 7208, you have SPF records for your EHLO hostnames both because of bounces and because receivers are recommended to check them all the time. This implies that all of your mail sending machines should have SPF records, not just the ones that can send bounces.
(Now that I've looked that up, I may need to update some of our DNS records. Again.)