Thinking through issues a mail client may have with SMTP-time rejections
In response to my entry on who holds problematic email, Evaryont left a comment lamenting the general 'don't trust the user's mail client' approach of mail submission servers accepting more or less everything from clients and bouncing things later. This has prompted me to try to think about the issues involved, so today you get this entry in reaction.
First, the basics. We already do our best to verify SMTP sender addresses and immediately reject bad ones, for good reasons, and I hope that there's general agreement on this for sender addresses, so this is only about how we (and other people) should react to various sorts of bad destination addresses (SMTP RCPT TOs). For 'bad' destination addresses, there are two different cases; sometimes you know that the destination address is bad and could give a permanent SMTP rejection, and sometimes you can't verify the destination address and would give a SMTP temporary failure (a 4xx deferral).
For permanent rejections, the question is whether the user's mail client will do a better job of telling them about these bad addresses than your bounce message does. This is a straightforward question of UI design (and whether the mail clients expect rejection at all, which it really should and these days almost certainly does). In theory a mail client can do a substantially better job of helping the user deal with bad addresses than a bounce can; for example, the mail client could immediately abort message submission entirely, report to the user 'these addresses are bad, I have marked them in red for you', and give the user the opportunity to correct or remove them before (re-)sending the message. In practice a bounce may give the user a better record of the failures than, say, a temporary popup dialog box about 'these addresses failed' that gives them no way to record the information or react to it.
(Correcting or removing the bad addresses before the message is sent at all is an overall better experience for everyone involved in the email; consider future replies to all addresses, for example. Bounces are also much less convenient for correcting bad addresses and resending your message, since there's no straightforward path from the bounce to sending a new copy of the original to corrected addresses.)
For temporary deferrals, things get a lot more complicated in both mail client handling and UI design. Some temporary deferrals will be cured in time; if they are pushed back to the client, the client must maintain its own queue of messages and addresses to be re-submitted, manage scheduling further delivery attempts, and decide on warning messages after sufficient time. For many clients this is going to be complicated by intermittent and perhaps variable connectivity (where you're on the Internet but you can't talk to the original mail submission server). Some temporary deferrals will never be cured in time, and for them the client also has to eventually give up and somehow present this to the user to do something (alternately, the client can just let the user keep trying endlessly until the user themselves clicks a 'stop trying, give up' button). Notifying the user at all about initial temporary deferrals is potentially a bad idea, especially with an intrusive alert; unlike permanent rejections, this is not something the user really needs to deal with right away.
(The mail client could also immediately abort message submission when it gets temporary deferrals and give the user a chance to change or remove addresses, but it's not clear that this is the right choice. There are a lot of things that can cause curable temporary deferrals (in fact some may be cured in seconds, when DNS results finally show up and so on), and you probably don't want to not send your message to such addresses.)
Reliably maintaining queues and handling retries is fairly complicated, especially for a mail client that may only be run intermittently and have network connectivity only some of the time. My guess is that mail servers are probably in a much better position to do this most of the time, and for temporary deferrals that will be rapidly cured (for example, ones caused by slow to respond DNS servers) a mail server will probably get the message delivered sooner. Also, when the mail client is the one to handle temporary deferrals, it's going to wind up having to send (much) more data over its connection, especially if the message has multiple temporarily deferred destinations and they cure themselves at different times. Having the server handle all retries means that the server holds the message and the mail client only has to upload it to the server once.
(On mobile devices these extra message submissions are also going to burn more battery power, as Aristotle Pagaltzis noted in the context of excessive web page fetching in a comment on this recent entry.)
One tradeoff in email system design is who holds problematic email
When you design parts of a mail system, for example a SMTP submission server that users will send their email out through or your external MX gateway for inbound email, you often face a choice of whether your systems should accept email aggressively or be conservative and leave email in the hands of the sender. For example, on a submission server should you accept email from users with destination addresses that you know are bad, or should you reject such addresses during the SMTP conversation?
In theory, the SMTP RFCs combined with best practices give you an unambiguous answer; here, the answer would be that clearly the submission server should reject known-bad addresses at SMTP time. In practice things are not so simple; generally you want problematic email handled by the system that can do the best job of dealing with it. For instance, you may be extremely dubious about how well your typical mail client (MUA) will handle things like permanent SMTP rejections on RCPT TO addresses, or temporary deferrals in general. In this case it can make a lot of sense to have the submission machine accept almost everything and sort it out later, sending explicit bounce messages to users if addresses fail. That way at least you know that users will get definite notification that certain addresses failed.
A similar tradeoff applies on your external MX gateway. You could insist on 'cut-through routing', where you don't say 'yes' during the initial SMTP conversation until the mail has been delivered all the way to its eventual destination; if there's a problem at some point, you give a temporary failure and the sender's MTA holds on to the message. Or you could feel it's better for your external MX gateway to hold inbound email when there's some problem with the rest of your mail system, because that way you can strongly control stuff like how fast email is retried and when it times out.
Our current mail system (which is mostly described here) has generally been biased towards holding the email ourselves. In the case of our user submission machines this was an explicit decision because at the time we felt we didn't trust mail clients enough. Our external MX gateway accepted all valid local destinations for multiple reasons, but a sufficient one is that Exim didn't support 'cut-through routing' at the time so we had no choice. These choices are old ones, and someday we may revisit some of them. For example, perhaps mail clients today have perfectly good handling of permanent failures on RCPT TO addresses.
(A accept, store, and forward model exposes some issues you might want to think about, but that's a separate concern.)
(We haven't attempted to test current mail clients, partly because there are so many of them. 'Accept then bounce' also has the benefit that it's conservative; it works with anything and everything, and we know exactly what users are going to get.)
In praise of uBlock Origin's new 'element zapper' feature
The purpose of the element zapper is to quickly deal with the removal of nuisance elements on a page without having to create one or more filters.
uBlock Origin has always allowed you to permanently block page elements, and a while back I started using it aggressively to deal with the annoyances of modern websites. This is fine and works nicely, but it takes work. I have to carefully pick out what I want to target, maybe edit the CSS selector uBlock Origin has found, preview what I'm actually going to be blocking, and then I have a new permanent rule cluttering up my filters (and probably slightly growing Firefox's memory usage). This work is worth it for things that I'm going to visit regularly, but some combination of the amount of work required and the fact that I'd be picking up a new permanent rule made me not do it for pages I was basically just visiting once. And usually things weren't all that annoying.
Enter Medium and their obnoxious floating sharing bar at the
bottom of pages.
These things can be blocked on Medium's website itself with a
straightforward rule, but the problem is that tons of people use
Medium with custom domains. For example, this article
that I linked to in a recent entry. These days it seems like
every fourth article I read is on some Medium-based site (I exaggerate,
but), and each of them have the Medium sharing bar, and each of
them needs a new site-specific blocking rule unless I want to
globally block all <divs> with the class
Medium changes the name).
(Globally blocking such a <div> is getting really tempting, though. Medium feels like a plague at this point.)
The element zapper feature deals with this with no fuss or muss. If I wind up reading something on yet another site that's using Medium and has their floating bar, I can zap it away in seconds The same is true of any number of floating annoyances. And if I made a mistake and my zapping isn't doing what I want, it's easy to fix; since these are one-shot rules, I can just reload the page to start over from scratch. This has already started encouraging me to do away with even more things than before, and just like when I started blocking elements, I feel much happier when I'm reading the resulting pages.
(Going all the way to using Firefox's Reader mode is usually too much of a blunt hammer for most sites, and often I don't care quite that much.)
PS: Now that I think about it, I probably should switch all of my
per-site blocks for Medium's floating bar over to a single
##div.js-stickyFooter' block. It's unlikely to cause any collateral
damage and I suspect it would actually be more memory and CPU
(And I should probably check over my personal block rules in general, although I don't have too many of them.)
Links: Git remote branches and Git's missing terminology (and more)
Mark Jason Dominus's Git remote branches and Git's missing terminology (via) is about what it sounds like. This is an issue of some interest to me, since I flailed around in the same general terminology swamp in a couple of recent entries. Dominus explains all of this nicely, with diagrams, and it helped reinforce things in my mind (and reassure me that I more or less understood what was going on).
He followed this up with Git's rejected push error, which covers a
git push' issue with the same thoroughness as his first
article. I more or less knew this stuff, but again I found it useful
to read through his explanation to make sure I actually knew as much
as I thought I did.
My situation with Twitter and my Firefox setup (in which I blame pseudo-XHTML)
Although it is now a little bit awkward to do this, let's start with my tweet:
Twitter does this with a <noscript> meta-refresh, for example:
<noscript><meta http-equiv="refresh" content="0; URL=https://mobile.twitter.com/i/nojs_router?path=%2Fthatcks%2Fstatus%2F877738130656313344"></noscript>
Firefox (via NoScript), Twitter
included, my Firefox acts on this
<noscript> block. What is
supposed to happen here is that you wind up on the mobile version
of the tweet, eg, and
then just sit there with things behaving normally. In my development
tree Firefox, the version of
this page that I get also contains another <noscript> meta-refresh:
<noscript><meta content="0; URL=https://mobile.twitter.com/i/nojs_router?path=%2Fthatcks%2Fstatus%2F877738130656313344" http-equiv="refresh" /></noscript>
This is the same URL as the initial meta-refresh, and so Firefox sits there going through this cycle over and over and over again, and in the mean time I see no content at all, not even the mobile version of the tweet.
In other environments, such as Fedora 25's system version of Firefox 54, Lynx, and wget, the mobile version of the tweet is a page without the circular meta-refresh. At first this difference mystified me, but then I paid close attention to the initial HTML I was seeing in the page source. Here is the start of the broken version:
<!DOCTYPE html> <html dir="ltr" lang="en"> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0" /> <noscript>[...]
(I suspect that this is HTML5.)
And here is the start of the working version:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.1//EN" "http://www.openmobilealliance.org/tech/DTD/xhtml-mobile11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> [... much more verbiage ...]
Although this claims to be some form of XHTML in its declarations,
Twitter is serving this with a Content-Type of
makes it plain old HTML soup as far as Firefox is concerned (which
is a famous XHTML issue).
What I don't understand is why Twitter serves HTML5 to me in one
browser and pseudo-XHTML to me in another. As far as I can tell,
the only significant thing that differs here between the system
version of Firefox and my custom-compiled one is the User-Agent
(and in particular both are willing to accept XHTML). I can get
Twitter to serve me HTML5 using
wget, but it happens using
either User-Agent string:
--user-agent 'Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0' https://mobile.twitter.com/thatcks/status/877738130656313344 | less
Sidebar: How I worked around this
Initially I went on a long quest to try to find an extension that would turn this off or some magic trick that would make Firefox ignore it (and I failed). It turns out that what I need is already built into NoScript; the Advanced settings have an option for 'Forbid META redirections inside <NOSCRIPT> elements', which turns off exactly the source of my problems. This applies to all websites, which is a bit broader of a brush than would be ideal, but I'll live with it for now.
(I may find out that this setting breaks other websites that I use, although I hope not.)
Why we're not running the current version of Django
We have a small web app that uses Django (or is based on Django, depending on your perspective). As I write this, the latest version of Django is 1.11.2, and the 1.11 series of releases started back in April. We're running Django 1.10.7 (I'm actually surprised we're that current) and we're probably going to keep on doing that for a while.
(Now that I'm looking at this, I see that Django 1.10 will get security fixes through December (per here), so we'll probably stick with it until the fall. Summer is when new graduate students show up, which means that it's when the app is most important and most heavily used.)
On the one hand, you might ask what's the problem with this. On the other hand, you might ask why we haven't updated. As it turns out, both questions wind up in the same place. The reason I don't like being behind on significant Django version updates is that the further behind we get, the more work we have to do all at once to catch up with Django changes (not all of which are clear and well documented, and some of which are annoying). And the reason we're behind is exactly that Django keeps changing APIs from version to version (including implicit APIs).
The net result is that Django version updates are not drop in things. Generally each of them is a not insignificant amount of work and time. At a minimum I have to read a chunk of the release notes very carefully (important things are often mentioned in passing, if at all), frequently I need code changes, and sometimes Django deprecates an API in a way that leaves me doing a bunch of research and programming to figure out what to replace it with (my notes say that this is the case for 1.11). Our small web app otherwise needs basically no attention, so Django version updates are a real pain point.
At the same time, I've found out the hard way that if I start delaying this pain for multiple version updates, it gets much worse. For a start, I fast-forward through any opportunity to get deprecation warnings; instead a feature or an API can go straight from working (in our current version) to turned off. I also have to absorb and understand all of the changes and updates across multiple versions at once, rather than getting to space them out bit by bit. So on the whole it goes better to go through every Django version.
I don't expect this to ever change. I don't think the Django developers have any sort of policy about how easy it should be to move from, say, one Django long term support release to another. And in general I don't expect them to ever get such a policy that would significantly slow them down. The Django developers clearly like the freedom to change their APIs relatively fast, and the overall Django community seems to be fine with it. We're outliers.
(This issue is one reason to not stick with the version of Django that's supplied by Ubuntu. Even Ubuntu 16.04 only packages Django 1.8. I really don't want to think about what it would be like to jump forward four years in Django versions when we do our typical 'every second Ubuntu LTS release' update of the web server where our Django app runs. Even Ubuntu 14.04 to Ubuntu 16.04 is a jump from Django 1.6 to Django 1.8 all at once.)
The oddity of CVE-2014-9940 and the problem of recognizing kernel security patches
Let's start with my tweet:
Do I want to know why a reported and disclosed in 2017 Linux kernel vulnerability has a 2014 CVE number? Probably not.
Today, Ubuntu came out with USN-335-1, a security advisory about their Ubuntu 14.04 LTS kernel. Among the collection of CVEs fixed was one that caught my eye, CVE-2014-9940. This was simply because of the '2014' bit, which is normally the year of the CVE. At first I thought this might be Ubuntu's usual thing where they sometimes repeat old, long-patched issues in their update announcements, but no; as far as I can tell this is a new issue. Ubuntu's collection of links led to the May Android security bulletin, which says that CVE-2014-9940 was only reported on February 15th, 2017.
(I think that the Android security bulletin is the first report.)
So where does the 2014 come from? That's where I wound up looking more closely at the kernel commit that fixes it:
Author: Seung-Woo Kim
Date: Thu Dec 4 19:17:17 2014 +0900
regulator: core: Fix regualtor_ena_gpio_free not to access pin after freeing
After freeing pin from regulator_ena_gpio_free, loop can access the pin. So this patch fixes not to access pin after freeing.
This fix was authentically made between 3.18-rc1 and 3.19-rc1, and so it appears that the CVE number was assigned based on when the fix was made, not when the issue was reported.
The next question is why it took until 2017 for vendors using old kernels to patch them against this issue. Although I don't know for sure, I have a theory, namely that this simply wasn't recognized as a security vulnerability until early this year. Many fixes go into every kernel version, far too many to backport them all, so Linux distributions have to pick and choose. Naturally distributions grab security fixes, but that requires everyone involved to actually recognize that what they've written is a security fix. I rather suspect that back in 2014, no one realized that this use-after-free issue was an (exploitable) vulnerability.
It's interesting that this seems to have been reported around the time of CVE-2017-6074, where I started to hear that use-after-free issues in the kernel were increasingly exploitable. I wonder if people went trawling through kernel changelogs to find 'fixed use-after-free issue' changes like this one, then did some digging to see if the issues could be weaponized into vulnerabilities and if any currently used older kernels (such as Android kernels and old Ubuntu LTSes) had missed picking up the patches.
(If people are doing this sort of trawling, I suspect that we can expect a whole series of future CVEs on similar issues.)
If I'm right here, the story of CVE-2014-9940 makes for an excellent example of how not all security fixes are realized to be so at the time (which makes it hard to note them as security fixes in kernel changelogs). As this CVE demonstrates, sometimes it may take more than two years for someone to realize that a particular bugfix closes a security vulnerability. Then everyone with old kernels gets to scramble around to fix them.
(By the way, the answer to this is not 'everyone should run the latest kernel'. Nor is it 'the kernel should stop changing and everyone should focus on removing bugs from it'. The real answer is that there is no solution because this is a hard problem.)
Plan for manual emergency blocks for your overall mail system
Last year, I wrote about how your overall anti-spam system should have manual emergency blocks. At the time I was only thinking about incoming spam, but after some recent experiences here, let me extend that and say that all entry points into your overall mail system should have emergency manual blocks. This isn't just about spam or bad mail from the outside, or preventing outgoing spam, although those are important things. It's also because sometimes systems just freak out and explode, and when this happens your mail system can get deluged as a result. Perhaps a monitoring system starts screaming in email, sending thousands of messages over a short span of time. Perhaps someone notices that a server isn't running a mailer and starts it, only to unleash several months worth of queued email alerts from said server. Perhaps some outside website's notification system malfunctions and deluges some of your users (or many of them) with thousands of messages.
(There are even innocent cases. Email between some active upstream email source (GMail, your organization's central email system, etc) and your systems might have clogged up, and now that the clog has been cleared the upstream is trying to unload all of that queued email on you as fast as it can. You may want some mechanisms in place to let you slow down that incoming flood once you notice it.)
We now have an initial set of blocks, but I'm not convinced that they're exactly what you should have; our current blocks are partly a reaction to the specific incidents that happened to us and partly guesswork about what we might want in the future. Since anticipating the exact form of future explosions is somewhat challenging, our guesswork is probably going to be incomplete and imperfect. Still, it beats nothing and there's value in being able to stop a repeat incident.
(Our view is that we've built are some reasonably workable but crude tools for emergency use, tools that will probably require on the spot adjustment if and when we have to turn them on. We haven't tried to build reliable, always-on mechanisms similar to our anti-spam internal ratelimits.)
We have a reasonably complicated mail system with multiple machines running MTAs; there's our inbound MX gateway, two submission servers, a central mail processing server, and some other bits and pieces. One of the non-technical things we've done in the fallout from the recent incidents is to collect in one spot the information about what you can do on each of them to block email in various ways. Hopefully we will keep this document updated in the future, too.
(You may laugh, but previously the information was so dispersed that I actually forgot that one blocking mechanism already existed until I started doing the research to write all of this up in one place. This can happen naturally if you develop things piecemeal over time, as we did.)
Sidebar: Emergency tools versus routine mechanisms
Some people would take this as a sign that we should have always-on mechanisms (such as ratelimits) that are designed to automatically keep our mail system from being overwhelmed no matter what happens. My own view is that designing such mechanisms can be pretty hard unless you're willing to accept ones that are set so low that they have a real impact in normal operation if you experience a temporary surge.
Actually, not necessarily (now that I really think about it). It is in our environment, but that's due to the multi-machine nature of our environment combined with some of our design priorities and probably some missing features in Exim. But that's another entry.
How I'm currently handling the mailing lists I read
I recently mentioned that I was going to keep filtering aside email from the few mailing lists that I'm subscribed to, instead of returning it to being routed straight into my inbox. While I've kept to my decision, I've had to spend some time fiddling around with just how I was implementing it in order to get a system that works for me in practice.
What I did during my vacation (call it the vacation approach) was to use procmail recipes to put each mailing list into a file. I'm already using procmail, and in fact I was already recognizing mailing lists (to insure they didn't get trapped by anti-spam stuff), so this was a simple change:
:0: * ^From somelist-owner@... lists/somelist #V#$DEFAULT
This worked great during my vacation, when I basically didn't want
to pay attention to mailing lists at all, but once I came back to
work I found that filing things away this way made them too annoying
to deal with in my mail environment. Because MH
doesn't deal directly with mbox format files, I needed to go through
a whole dance with
inc and then rescanning my inbox and various
other things. It was clear that this wasn't the right way to go.
If I wanted it to be convenient to read this email (and I did),
incoming mailing list messages had to wind up in MH folders.
Fortunately, procmail can do this if you specify '
as the destination (the '
/.' is the magic). So:
:0: * ^From somelist-owner@... /u/cks/Mail/inbox/somelist/.
(This is not quite a complete implementation, because it doesn't
do things like update MH's
unseen sequence for the folder. If you
want these things, you need to pipe messages to
In my case, I actually prefer not having an
unseen sequence be
maintained for these folders for various reasons.)
The procmail stuff worked, but I rapidly found that I wanted some
way to know which of these mailing list folders actually had pending
messages in them. So I wrote a little command which I'm calling
mlists'. It goes through my
.procmailrc to find all of the MH
destinations, then uses
ls to count how many message files there
are and reports the whole thing as:
:; mlists +inbox/somelist: 3 :;
If there's enough accumulated messages to make looking at the folder worthwhile, I can then apply standard MH tools to do so (either from home with standard command line MH commands, or with exmh at work).
It's early days with this setup, but so far I feel satisfied. The
filtering and filing stuff works and the information
provides is enough to be useful but sufficiently minimal to push
me away from actually looking at the mailing lists for much of the
time, which is basically the result that I want.
PS: there's probably a way to assemble standard MH commands to
give you a count of how many messages are in a folder. I used
because I couldn't be bothered to read through MH manpages to work
out the MH way of doing it, and MH's simple storage format makes
this kind of thing easy.
One reason you have a mysterious Unix file called
Suppose, one day, that you look at the
ls of some directory and
you notice that you have an odd file called '
2' (just the digit).
If you look at the contents of this file, it probably has nothing
that's particularly odd looking; in fact, it likely looks like
plausible output from a command you might have run.
Congratulations, you've almost certainly fallen victim to a simple typo, one that's easy to make in interactive shell usage and in Bourne shell scripts. Here it is:
echo hi >&2 echo oop >2
The equivalent typo to create a file called
1 is very similar:
might-err 2>&1 | less might-oop 2>1 | less
1 files created this way are often empty, although not always,
since many commands rarely produce anything on standard error.)
In each case, accidentally omitting the '
&' in the redirection
converts it from redirecting one file descriptor to another (for
echo to report something to standard error)
into a plain redirect-to-file redirection where the name of the
file is your target file descriptor number.
Some of the time you'll notice the problem right away because you don't get output that you expect, but in other cases you may not notice for some time (or ever notice, if this was an interactive command and you just moved on after looking at the output as it was). Probably the easiest version of this typo to miss is in error messages in shell scripts:
if [ ! -f "$SOMETHING" ]; then echo "$0: missing file $SOMETHING" 1>2 echo "$0: aborting" 1>&2 exit 1 fi
You may never run the script in a way that triggers this error condition, and even if you do you may not realize (or remember) that you're supposed to get two error messages, not just the 'aborting' one.
(After we stumbled over such a file recently, I grep'd all of
my scripts for '
>2' and '
>1'. I was relieved not to find
(For more fun with redirection in the Bourne shell, see also how to pipe just standard error.)