Wandering Thoughts archives

2015-03-16

Solving our authenticated SMTP problem by rethinking it

Part of our mail system is a mail submission machine. Perhaps unlike many places, this machine has never done authenticated SMTP and as a result has never accepted connections from the outside world; to use it, you have to be 'inside' our network, either directly or by using our VPN (and at that point it just accepts your email). Recently this has been more and more a pain point for our users as it becomes more and more common for devices to move between inside and outside (for example, smartphones).

Unfortunately, one reason we haven't supported authenticated SMTP before now is that it's non-trivial to add to our mail submission machine. There are two tricky aspects. The first is that as far we can see, any easy method to add authentication support to our Exim configuration requires that our mail submission machine be rebuilt to carry our full /etc/passwd (and /etc/shadow). The second is that the mail submission machine still has to support unauthenticated SMTP from internal machines; among other things, all of our servers use it as their smarthost. This requires a somewhat messy and complex Exim configuration, and being absolutely sure that we're reliably telling apart internal machines from external machines and not accidentally allowing external machines to use us without SMTP authentication (because that would make one of our most crucial mail machines into an open relay and get it blacklisted).

(Right now the mail submission machine has a strong defense in that our external firewall simply doesn't allow outside people to connect to it. It has its own access guard just in case, but its accuracy is less important. In the new world we'd have to open up access on the firewall and then count on its Exim configuration to do all the work.)

Exim can use a local Dovecot instance for authentication, but that doesn't help the mail submission machine directly; to run a local Dovecot that did useful authentication, we'd still need a full local /etc/passwd et al. But then we had a brainwave: we already have a Dovecot-based IMAP server.

Rather than try to modify the mail submission machine's Exim configuration to add authenticated SMTP for some connections, we can turn the problem around and do it on the IMAP server instead. The IMAP server already has Dovecot and our full /etc/passwd; all it needs is to have Exim added with a configuration that only does authenticated SMTP. Sure, we wind up with two mail submission machines, but this way we don't have to mix the two somewhat different mail submission roles and we get a much simpler change to our existing machines. People also get a somewhat simpler IMAP client configuration (and one that's probably more normal), since now their (outgoing) mail server will be the same as their IMAP server.

(The actual Exim configuration on our IMAP server can be just a slight variation on the existing mail submission Exim configuration. Insisting on SMTP authentication all the time is an easy change.)

As a side benefit, testing and migration is going to be pretty easy. Nothing is trying to talk SMTP to the IMAP server today, so we can transparently add Exim there then have people try out using it as their (outgoing) mail server. If something goes wrong, the regular mail submission machine is completely unaltered and people can just switch back.

sysadmin/AuthenticatedSMTPOurWay written at 23:41:40; Add Comment

Our difficulties with OmniOS upgrades

We are not current on OmniOS and we've been having problems with it. At some point, well meaning people are going to suggest that we update to the current release version with the latest updates and mention that OmniOS makes this really quite easy with beadm and boot environments. Well, yes and no.

Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.

The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.

The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.

The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.

(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)

PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.

solaris/OmniOSUpgradeDifficulties written at 00:35:39; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.