2016-06-30
What makes a email MIME part an attachment?
If you want to know what types of files your users are getting
in email, it's likely that an important
prerequisite is being able to recognize attachments in the first
place. In a sane and sensible world, this would be easy; it would
just be any MIME part with a Content-Disposition header of
attachment.
I regret to tell you that this is not a sane world. There are mail clients that
give every MIME part an inline Content-Disposition, so naturally
this means that most mail clients can't trust an inline C-D and
make their attachment versus non-attachment decisions based on other
things. (I expect that there are mail clients that ignore a C-D of
attachment, too, and will display some of those parts inline if
they feel like it, but for logging we don't care much about that.)
MIME parts may have (proposed, nominal) filenames associated with
them, from either Content-Type or Content-Disposition. However,
neither the presence nor the absence of a MIME filename determines
something's attachment status. Real attachments may have no proposed
filename, and there are mail clients that attach filenames to things
like inline images. And really, I can't argue with them; if the user
told you that this (inline) picture is mydog.jpg, you're certainly
within your rights to pass this information on in the MIME headers.
The MIME Content-Type provides at least hints, in that you can
probably assume that most mail clients will treat things with any
application/* C-T as attachments and not try to show them inline.
And if you over-report here (logging information on 'attachments'
that will really be shown inline), it's relatively harmless. It's
possible that mail clients do some degree of content sniffing, so
the C-T is not necessarily going to determine how a mail client
processes a MIME part.
(At one point web browsers were infamous for being willing to
do content sniffing on HTTP replies, so that what you served
as eg text/plain might not be interpreted that way by some
browsers. One can hope that mail clients are more sane, but
I'm not going to hold my breath there.)
One caution here: trying to make decisions based on things having
specific Content-Type values is a mug's game. For example, if you're
trying to pick out ZIP files based on them having a C-T of
application/zip, you're going to miss a ton of them; actual real
email has ZIP files with all sorts of MIME types (including the
catch-all value of application/octet-stream). My impression is that
the most reliable predictor of how a mail client will interpret an
attachment is actually the extension of its MIME filename.
(While the gold standard for figuring out if something is a ZIP
file or whatever is actually looking at the data for the MIME part,
please don't use file (or libmagic) for general file classification.)
One solution is certainly to just throw up our hands and log everything; inline, attachment, whatever, just log it all and we can sort it out later. The drawback on this is that it's going to be pretty verbose, even if you exclude inline text/plain and text/html, since lots of email comes with things like attached images and so on.
The current approach I'm testing is to use a collection of signs
to pick out attachment-like things, with some heuristics attached.
Does a MIME part declare a MIME filename ending in .zip? Then
we'll log some information about it. Ditto if it has a Content-Disposition
of attachment, ditto if it has a Content-Type of application/*,
and so on. I'm probably logging information about some things that
mail clients will display inline, but it's better than logging too
little and missing things.
(Then there is the fun game of deciding how to exclude frequently emailed attachment types that you don't care about because you'll never be able to block them, like PDFs. Possibly the answer is to log them anyways just so you know something about the volume, rather than to try to be clever.)
2016-06-27
If you send email, don't expect people to help you with abuse handling
I'll start with the tweets:
@thatcks: I see these spammers used @MailChannels to hit us once before, in April. I reported them then, but I have no time for this shit any more.
Back in April, a persistent long-term spammer of one of our addresses attempted to send it spam via MailChannels, a commercial email sending outfit. I complained to MC's abuse contacts at the time, because I'm an optimist, and someone at MC got back to me to tell me this spammer had been fixed. Then they came back now (well, a couple of days ago).
@thatcks: As has been said many, many times before, expecting the receivers of email to be your anti-spam detection method is utterly broken.
Some people might say that I should do the 'responsible' thing and once again report this incident to MailChannels. These people are wrong. It is always the sender's responsibility to detect that they are sending spam and take steps to deal with it; as has been said many years ago, abuse reports are a gift (one that comes from fewer and fewer people these days). In my case, my only real interest is in making the spam stop and generally I have far more effective ways of doing this than sending in complaints.
(By the way, I hope we can agree that there is absolutely no moral basis for saying that people have a responsibility to report spam. If your service is spamming me, I am getting absolutely nothing out of this and I accordingly owe you absolutely nothing. In fact, morally speaking you owe me for inflicting costs on me.)
In this specific situation, it's also clear that sending in complaints is not effective (cf). After all, I already did that once, got an assurance that it was dealt with, and the spammer came back a couple of months later. A repeat report is likely to net exactly the same result at best.
Then MailChannels popped up:
@MailChannels: @thatcks We don't take abuse of our network lightly and are keen to investigate. Please send us sample messages to support@mailchannels.com
This is a form tweet. It betrays at least an inability to read my original message.
(Replying to aggravated people with form tweets that betray a lack of thinking human involvement is, at the least, going to aggravate them further. So it proved here.)
@thatcks: .@MailChannels You're asking me to do more work to help you out. Why would I do that? If you want, you have enough information already.
I gave the form tweet all the response I felt that it deserved. And it's true that MailChannels has all the information they need; they could just search their April abuse reports for my name, find the address here that I reported was hit, and see if that address was sent to recently. Why yes, yes it was. MailChannels' email to it was even rejected this time around too, which really ought to be one of a number of danger signs for MailChannels. Certainly this would take some work on MailChannels' part, but you know, they're the people that this benefits, not me; I've already taken effective steps on our side.
(MailChannels benefits because they get rid of a spammer who may drag their reputation down and damage the deliverability of email for other paying customers, which would cost MailChannels money.)
Of course, I expect that MailChannels did nothing here. That's the easy way to blow off problem indicators while feeling good about yourself; you can say 'well, if it was real the person would have totally taken us up on our offer'. They can tick off the 'we tried' box and consider the matter done. And really, what mail sending service can afford to actually do a good job with spam?
(Applications of this pattern to, say, bug reports and bug trackers are left as an exercise for the reader.)
2016-06-12
There are (at least) two sorts of DNS blocklists
Here is a trite and obvious thing that I never the less feel like writing down: in practice, there are (at least) two sorts of DNS anti-spam blocklists. Since I want to use value neutral terms here, let us call these 'simple' and 'complex' blocklists.
The operation of a simple DNSBL is, well, simple. If it sees spam from an IP, it lists the IP (or if it sees whatever is the DNSBL's idea of 'bad stuff'). Usually the IP gets automatically delisted after a while, but in some DNSBLs the listing lasts forever unless someone takes action to have it get cleared, appealed, or whatever.
A complex DNSBL attempts to have a more complex balancing criteria for adding listings than simple presence of spam; for instance, it may somehow assess how much apparently legitimate traffic it's seen from the source IP as well as spam volume. A complex DNSBL is sometimes going to be slower to list an IP than a simple DNSBL.
A simple DNSBL does not have 'false positives' as such (assuming that it's honestly run), but that's because a listing means something very narrow; it means that the IP did a bad thing within the time horizon. People who reject email based on a listing in a simple DNSBL may have false positives in that rejection, though, because an IP doing a bad thing once doesn't necessarily mean that it will do it every time. Complex DNSBLs can have false positives because they're fundamentally intended to assert that the email you're getting is probably bad. Good operators of complex DNSBLs attempt to minimize such false positives.
To give an example of each, the Spamhaus SBL is a complex DNSBL (or at least generally it is). The CBL is a simple DNSBL, but one that (theoretically) uses a very narrow listing criteria that is very strongly correlated with sending only spam.
Unfortunately not all DNSBLs make it clear what sort of DNSBL they are in their description (or sometimes they wave their hands about it a bit). At least at the moment, one quite strong signal that you are dealing with a simple DNSBL is if it ever lists one of GMail's outgoing mail servers.
(I feel that rejecting email based on a simple DNSBL is not necessarily a mistake, but the sidebar attempting to explain this got long enough that it's going to be another entry.)
2016-06-02
Spammers can abandon SMTP connections not infrequently
As a result of looking at my SMTP session logs, one of the things that I've started tracking on my 'sinkhole' spamtrap SMTP server is how many senders reach the point where they actively get rejected by my server versus how many senders just disconnect with incomplete sessions where everything has gone fine up to that point. My SMTP session logging said that at least some just gave up, but I wasn't sure how many did this.
(Under normal circumstances you'd expect real sending mailers to almost never just abandon an incomplete session. It's not 'never' because there will always be some sending mailers that have their machine reboot out from underneath them or the like as they're trying to send out a message, but this is not exactly common so it should be very low.)
My results so far are early and somewhat incomplete, but I'll give
you representative numbers anyways. The numbers I have handy right
now are that over the past two and a half days, I've seen 123
abandoned sessions to 440 sessions with refused SMTP commands, or
about a fifth of the sessions are just being abandoned. I don't
particularly have data on where the sessions are being abandoned,
but looking at my SMTP logs say that some senders drop the connection
while I'm sending my initial SMTP greeting banner and some drop it
as I answer their EHLO or HELO.
Now, I don't and can't know why senders are choosing to abandon their SMTP sessions to my sinkhole server. But one thing that my server does is trickle out its SMTP replies rather slowly (including the initial banner), specifically at a rate of one character every tenth of a second. I took this idea from OpenBSD's spamd, but when I put it in I didn't really expect it to do anything. It may be that I'm wrong here and there is a not insignificant amount of spammer software that either specifically recognizes this behavior or simply isn't interested in wasting its time on too-slow mailers.
(I don't yet feel like experimenting by turning this feature off and seeing if the number of abandoned sessions basically goes almost to zero.)
Applications of this to real, non-sinkhole mailers are left as an exercise. As far as I know, no real sending mailer cares about somewhat slow responses at this level, but I admit I haven't exactly attempted to get every major ISP and so on to send my sinkhole server some email just to see if it would work. Big places like Google and Outlook don't seem to have had any problems coping with my sinkhole server, for what that's worth.
Sidebar: what I consider an abandoned session versus a rejected one
A session counts as 'rejected' if the most recent valid HELO/EHLO,
MAIL FROM, RCPT TO, DATA or final '.' on messages was either
5xx'd or 4xx'd. This doesn't consider QUIT, RSET, or other
similar commands and it doesn't consider out of sequence commands.
A session counts as 'abandoned' if it got 'go ahead' 2xx/354 responses
to every valid, in-sequence SMTP command it tried but the sender
either closed the TCP connection or sent a QUIT.
Sessions with things like TLS setup failures don't count as either abandoned or rejected. I see some amount of those, some for sad reasons.