2015-03-16
Solving our authenticated SMTP problem by rethinking it
Part of our mail system is a mail submission machine. Perhaps unlike many places, this machine has never done authenticated SMTP and as a result has never accepted connections from the outside world; to use it, you have to be 'inside' our network, either directly or by using our VPN (and at that point it just accepts your email). Recently this has been more and more a pain point for our users as it becomes more and more common for devices to move between inside and outside (for example, smartphones).
Unfortunately, one reason we haven't supported authenticated SMTP
before now is that it's non-trivial to add to our mail submission
machine. There are two tricky aspects. The first is that as far we
can see, any easy method to add authentication support to our Exim
configuration requires that our mail submission machine be rebuilt
to carry our full /etc/passwd
(and /etc/shadow
). The second is
that the mail submission machine still has to support unauthenticated
SMTP from internal machines; among other things, all of our servers
use it as their smarthost. This requires a somewhat messy and complex
Exim configuration, and being absolutely sure that we're reliably
telling apart internal machines from external machines and not
accidentally allowing external machines to use us without SMTP
authentication (because that would make one of our most crucial
mail machines into an open relay and get it blacklisted).
(Right now the mail submission machine has a strong defense in that our external firewall simply doesn't allow outside people to connect to it. It has its own access guard just in case, but its accuracy is less important. In the new world we'd have to open up access on the firewall and then count on its Exim configuration to do all the work.)
Exim can use a local Dovecot instance for authentication, but that
doesn't help the mail submission machine directly; to run a local
Dovecot that did useful authentication, we'd still need a full local
/etc/passwd
et al. But then we had a brainwave: we already have
a Dovecot-based IMAP server.
Rather than try to modify the mail submission machine's Exim
configuration to add authenticated SMTP for some connections, we
can turn the problem around and do it on the IMAP server instead.
The IMAP server already has Dovecot and our full /etc/passwd
; all
it needs is to have Exim added with a configuration that only does
authenticated SMTP. Sure, we wind up with two mail submission
machines, but this way we don't have to mix the two somewhat different
mail submission roles and we get a much simpler change to our
existing machines. People also get a somewhat simpler IMAP client
configuration (and one that's probably more normal), since now their
(outgoing) mail server will be the same as their IMAP server.
(The actual Exim configuration on our IMAP server can be just a slight variation on the existing mail submission Exim configuration. Insisting on SMTP authentication all the time is an easy change.)
As a side benefit, testing and migration is going to be pretty easy. Nothing is trying to talk SMTP to the IMAP server today, so we can transparently add Exim there then have people try out using it as their (outgoing) mail server. If something goes wrong, the regular mail submission machine is completely unaltered and people can just switch back.
Our difficulties with OmniOS upgrades
We are not current on OmniOS and we've been having problems with
it. At some point, well meaning people are going to suggest that we
update to the current release version with the latest updates and
mention that OmniOS makes this really quite easy with beadm
and boot
environments. Well, yes and no.
Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.
The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.
The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.
The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.
(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)
PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.