2015-09-17
We know what you are
Recently, one of our administrative aliases got email from 'no-reply@researchgate.net' that looked like this:
Dewayne Perry invited you to join their network on ResearchGate and confirm authorship of your publications.
[...]
Software Engineering
3 Publications - 3 Citations[...]
While the department does do research and publications in software engineering (among other fields, of course), our administrative alias does not publish anything and as a result its nonexistent publications do not have citations.
People and organizations who send this sort of email are not fooling anyone. They might have fooled people a decade or two ago, when the Internet was young and spam was new and not something everyone has a great deal of exposure to, but not today. Today everyone who gets this sort of bogus email citing a random name they have never heard of knows exactly what this is. As a result, they know exactly what the organization sending the email is. So do bystanders who simply hear about this.
(Of course I did not use the 'unsubscribe' link that had been helpfully primed in the email. We use much more definite and final methods of ceasing to receive email from people like this. Neither did I bother to waste my time by attempting to send in any sort of complaint.)
I wish all of these sorts of people would stop pretending, but I suppose that's the naive part of this entry. As long as it fools a few people a little bit of the time, this sort of people will keep banging away. Heck, I expect them to keep banging away even after that point, should it ever arrive.
(See also why you can no longer have an 'invite-your-friends' feature.)
2015-08-15
Spam scoring systems are often not deliberately designed
In theory, my concerns about how other people's systems will react to us DKIM-signing only some of our email have a simple answer; if we don't add DMARC information that says to react to unsigned email in some way, they should do nothing. This is the spec compliant behavior and you'd have to be really obnoxious to decide to do otherwise. But that assumes that spam scoring systems are in fact deliberately designed, and my current belief is that the custom systems major email providers use are not in that sense. By that I mean that no human being sat down to write out and set up more than a small fraction of the scoring rules they use.
In today's world, one obvious path to a sophisticated spam scoring system is through various forms of statistical reasoning and machine learning (of which Bayesian spam filtering is a simple starting point). All of these techniques uncover correlations between message features and outside spam scores (as determined in various ways, such as through users telling you), and they're all blind to what those features mean as such and whether or not they 'should' be used for some purpose or interpreted in some way.
I assume that every major email provider is running such a system
as part of their overall spam filtering (and there's some evidence
for this in the behavior of their systems). I further assume that
they're all shoveling every message feature they can get their hands
on into these systems, because why not; the more features the better.
I also think it's extremely likely that one of these features is
DKIM information. At this point it's not particularly hard to come
up with scenarios where you can objectively find correlations between
things like the lack of a DKIM signature in email From: a particular
domain and the likelihood of such a message being seen as spam.
That there are legitimate email messages like this doesn't matter
to a machine learning system any more than the fact that you're not
supposed to use lack of DKIM signatures this way; all it cares about
is useful correlations.
No one set out to create a system that (ab)used lack of DKIM signatures this way and the generated scoring system is not deliberately designed by anyone; the most that people did was design the machine learning meta-system that trained itself on the massive collection of accumulated message data in order to create the generated scoring system. No one understands the generated rules (even Bayesian systems are hard to peer into, never mind more sophisticated approaches) and so no one can even consider auditing them for things that shouldn't be done.
The only way to avoid having some message feature inadvertently become part of a signal deep inside a machine learning system is to exclude it. I can't make GMail's and Hotmail's and Yahoo's spam filtering systems exclude DKIM signature information from the set of message features that they train their systems on. The best I can do is not provide them with the signal in the first place by never doing DKIM signatures, making all of our email identical in this.
(Of course, by doing so I'm also sending a signal, namely the total lack of any DKIM signatures for our domains. At the moment this seems like a less dangerous signal to send for various reasons.)
(I said a much shorter version of this in a comment on my previous entry, but I feel like writing it out in full as an entry.)
2015-08-14
My current views on using DomainKeys (DKIM) here
Almost five years ago I wrote about my then-new view of DKIM and how we might someday use it ourselves when we'd updated our mailers enough. Well, the mailers have been updated for a while and not only aren't we using DKIM, I'm not inclined to do so any time soon. Prompted by someone here asking for my opinions on DKIM today, here's my current views.
As far as inbound email goes, I've experimented with a Thunderbird extension to verify DKIM signatures, which showed me that a bunch of perfectly good email gets either warnings or outright failures. Given this result it's clear that our inbound mail gateway can't do anything active with DKIM results, like start rejecting or visibly marking such email; the false positives would swamp any genuine benefit or signal that might be present.
In terms of spam and DKIM, I've seen plenty of spam that has DKIM signatures (and I assume they're valid ones). I've also seen plenty that doesn't. If DKIM data provides some sort of useful signal about spam versus non-spam for email, making use of it is best left up to the black box commercial anti-spam system that we use.
(DKIM does have some clear use in anti-spam stuff since it's a component of DMARC and some people are actively using DMARC these days. But for a collection of reasons we're not going to start enforcing other people's DMARC policies on our inbound mail gateway, although the anti-spam system may take that into account when it scores email.)
For outgoing email, my major concern remains what it was before, namely how other people's systems will behave. I simply
don't know how other systems will react to all of our valid DKIM
signed email, email we DKIM signed but that then got changed in
transit, and email 'From:' us but without a DKIM signature from
us. Without confidence that adding DKIM signing will be harmless,
I don't feel any enthusiasm for doing so. At this point I'd probably
only enable DKIM if there was some significant recipient system
that started more or less demanding that we provide it in order to
get our email delivered to them.
(I'm sure that eg GMail would like us to start doing DKIM signing,
but that they'd like us to do that is exactly why I don't want to.
Almost anyone who actively cares about us doing DKIM is going to
use it as input into a spam scoring system, and since we consider
it fully valid to send email From: our addresses but not through
our machines, the last thing I want to do is enable that particular
signal.)
2015-07-27
Spammers mine everything, Github edition
It's not news that spammers will trawl everything they can easily get their hands on for anything that looks like email addresses. But every so often I get another illustration of this effect and it strikes me as interesting. This time around it's with the email address I use for Github.
This email address is of course an expendable address, since it's exposed in git commits that I push to Github. It's also exposed to Github itself, but I don't think Github leaks it (at least not trivially. Certainly the address remained untouched by spam for years. Then back in late May the address appeared in the plain text of a commit message. Last week, the spam started showing up.
(The actual spam was one offer from an email spam service provider, one student loan repayment scam, and one relatively incomprehensive one. All came from Chinese IPs; the second and the third came from the same /24 subnet, and the first one came from a SBL CSS listed IP.)
I find the couple of months time delay interesting but probably not too surprising. It's also probably not surprising that spammers mine Github in some way; there's a lot of email addresses exposed there. I'd like to say that spammers probably only mine web pages on Github instead of looking at Git repositories themselves, but that may not be the case; although I'm on Github, my repos are nowhere near as visible as the project where this address appeared.
Still, I found the whole thing kind of interesting (and kind of irritating, too, because now I will probably have to enact increasingly strong defenses on this address until I abandon it).
2015-07-19
'Retail' versus 'wholesale' spam
A while back I mentioned that the spam received by my spamtrap SMTP server is boring; it's mostly advanced fee frauds, phishes, and the like. In light of that and that GMail based people keep trying to send me spam, I've been thinking about how one way to split up spam is between what I'll call retail spam and wholesale spam.
Wholesale spam is the high volume emitters, the people who are doing it in enough volume that they have real infrastructure and automation of some sort. These are the 'email marketing' people and the people who wind up on the SBL and so on and so forth. The modern problem for them is that their very volume makes them recognizable and thus blockable. We have DNS blocklists, we have spam feature recognition in filtering systems, and so on and so forth. As a result of this, I think that wholesale spam is a mostly solved problem for most systems.
Retail spam is the small volume and often hand entered stuff. It is people sitting in Internet cafes using stolen webmail credentials to send out more or less hand-written messages. This is the domain of a great deal of advance fee fraud and phish spam, and as a result of its comparatively small volume and hand done nature it's hard to do a really good job of blocking it today. It's probably always going to be hard to fully block this, and as a result I can unhappily look forward to GMail emitting this stuff in my direction for years to come.
(GMail is far from alone here, of course; any freemail service is a sending source for this stuff. I just notice GMail more than the others for various reasons.)
Maybe someday we'll figure out really effective tools against retail spam, but I doubt it. Stopping retail spam runs up against the fundamental problem of spam.
2015-06-19
Sometimes looking into spam for a blog entry has unexpected benefits
Today, I was all set to write an entry about how I especially hate slimy companies that gain access to people's address books. In fact I had a particular company in mind, because it's clear that they did this to one of our users recently. As part of starting to write that entry, I decided to do some due diligence research on the company involved. What I found turned out to be rather more alarming than I expected.
There are two usual run of the mill ways to steal people's address books. The 'not even sort of theft' way is to just ask people to give you their address books so you can connect them to any of their friends on your service, and then perhaps send some invitation mails yourself. The underhanded way is to persuade people to give you access to their GMail or Yahoo or whatever email account for some innocent-sounding purpose, then take a copy of their address book while you're there.
These people went the extra mile; they made a browser extension. Of course it does a lot more than just take copies of your address book and none of what it does seems particularly pleasant (at least to me). Getting a browser extension into people's browsers is probably harder than getting their address books in the usual way, but I imagine it's much more lucrative (and much more damaging).
What this means is that our user didn't just give a company access to their address book; instead they've wound up infected by something that is more or less malware (and of course this means that their machine may also have other problems). And I wouldn't have found any of this if I hadn't decided to turn over this particular rock as part of writing a blog entry.
(It turns out this company has a Wikipedia entry. It's rather eyebrow raising in a 'this seems so whitewashed it's blinding' kind of way. Since it was so obviously white, I dipped into the edit history and the talk page and found both rather interesting, ie there was and may still be a roiling controversy that is not reflected in the page contents. I'm kind of sad to see Wikipedia (ab)used this way, but I'm not wading into that particular swamp for any reason.)
2015-06-12
Red Hat are marketing email spammers now (in the traditional way)
We used to use Red Hat Enterprise Linux (in our previous fileserver generation and in a few other roles), although we've wound up switching to CentOS. As part of having those RHEL machines we have a RHN account, which is registered with a specific email alias here. RHN uses that email address to do things like notify us about important security updates, machines not responding, and so on. Although in practice all of those are basically noise, that's okay; that's what we registered the email address for and RHN is only doing what we told it to.
The other day we got the following email to that address from a Red Hat address, sent from Red Hat's own SMTP servers:
Subject: Red Hat Forum: Build an Efficient and Agile IT Organization for the Future - On Behalf Red Hat
Dear Valued Client,
We would like thank you for attending our Mobile Enterprise Application Workshop. We hope you enjoyed it. Since may of the attendees have requested, we are pleased to share with you our upcoming forum you may be interested in.
Join our annual Red Hat Forum on June 18 , 2015 for an insightful morning with industry leading analysts from IDC [...]
This is not RHN notification email. More than that, the first paragraph is a further lie; we didn't (and haven't) attended any Red Hat 'Mobile Enterprise Application Workshop'. Oh, and this claims to have been sent from Red Hat's Canadian office but includes no unsubscribe link, which means that it is clearly in violation of recent Canadian anti-spam legislation on top of everything else.
At one level I'm not particularly surprised. Companies do this all the time, often although not exclusively as a result of address list creep. Red Hat is just the latest one, and why would I be surprised at that? Everyone screws you eventually (it's why modern email is such a hassle).
At another level I'm terribly disappointed. At one time I could think of Red Hat as clearly good guys, people who would never ever behave in such an unethical and frankly slimy way. Clearly those days are over now, as Red Hat has given me a clear and unambiguous sign that marketing is winning over morals. I'm not sure what I can expect next, but I'm sure I'm not going to like it.
(Maybe Red Hat marketing will win the argument that everyone who has ever submitted a RHEL related Bugzilla report is fair game for RHEL related marketing emails.)
PS: I sent email to Red Hat when we got this email. I have of course received no reply.
(This elaborates on my tweet at the time.)
2015-05-25
Email providers cannot stop spam by scanning outgoing email
One of the things that Amazon SES advertises that it (usually) does is that it scans the outgoing email that people send through it to block spam. This sounds great and certainly should mean that Amazon SES emits very low levels of spam, right? Well, no, not so fast. Unfortunately, no outgoing mail scanning on a service like this can eliminate spam. All it can do is stop certain sorts of obvious spam. This is intrinsic in the definition of 'spam' and the limitations of what a mail sending system like Amazon SES does.
Essentially perfect content scanning can tell you two things: whether the email has markers of known types of spam, such as phish, advance fee fraud, malware distribution, and so on, and whether the email will be be scored as spam by however many spam scoring systems you can get your hands on the rules for. These are undeniably useful things to know (provided that you act on them), but messages that fail these tests are far from the only sorts of spam. In particular, basically all sorts of advertising and marketing emails cannot be blocked by such a system because what makes these messages spam is not their content, it's that they are unsolicited (cf, cf).
The only way to even theoretically tell whether a message is solicited or unsolicited is to control not just the sending of outgoing email but the process of choosing destination email addresses. If you only scan messages but don't control addresses, you have very little choice but to believe the sender when they tell you 'honest, all of these addresses want this email'. And then the marketing department of everyone and sundry descends on Amazon SES with their list of leads and prospects and people to notify about their very special whatever it is that of course everyone will be interested in, and then Amazon SES is sending spam.
(Or the marketing people buy 'qualified email addresses' from spam providers because why not, you could get lucky.)
There is absolutely nothing content filtering can do about this. Nothing. You could have a strong AI reading the messages and it wouldn't be able to stop all of the UBE.
(I wrote a version of this as a comment reply on my Amazon SES entry but I've decided it's an important enough point to state and elaborate in an entry.)
2015-05-22
Unsurprisingly, Amazon is now running a mail spamming service
I recently got email from an amazonses.com machine, cheerfully
sending me a mailing list message from some random place that
desperately wanted me to know about their thing. It was, of course,
spam, which means that Amazon is now in the business of running a
mail spamming service. Oh, Amazon doesn't call what they're running
a mail spamming service, but in practice that's what it is.
For those that have not run into it, amazonses.com is 'Amazon Simple Email Service', where Amazon carefully sends out email for you in a way that is designed to get as much of it delivered as possible and to let you wash annoying people who complain out of your lists as effectively as possible (which probably includes forwarding complaints from those people to you, which is something that has historically caused serious problems for people who file complaints due to spammer retaliation). I translate from the marketing language on their website, of course.
In the process of doing this amazonses.com sends from their own
IP address space, using their own HELO names, their own domain
name, and completely opaque sender envelope address information.
Want to get some email sent through amazonses.com but not the email
from spammers you've identified? You're plain out of luck at the
basic SMTP level; your only option is to parse the actual message
during the DATA phase and look for markers. Of course this helps
spammers, since they get a free ride on the fact that you may not
be able to block amazonses.com email in general.
I'm fairly sure that Amazon does not deliberately want to run a mail spamming service. It's just that, as usual, not running a mail spamming service would cost them too much money and too much effort and they are in a position to not actually care. So everyone else gets to lose. Welcome to the modern Internet email environment, where receiving email from random strangers to anything except disposable email addresses gets to be more and more of a problem every year.
(As far as I can tell, Amazon does not even require you to use their own mailing list software, managed by Amazon so that Amazon can confirm subscriptions and monitor things like that. You're free to roll your own mail blast software and as far as I can tell my specific spammer did.)
2015-04-29
The 'EHLO ylmf-pc' plague of SMTP authentication guessers
If you run a mail server on the Internet and look at your logs, you may
have noticed a lot of connections from machines that EHLO with the
name ylmf-pc. There are many pages about this on the web, and the
general consensus is that this is some sort of long standing brute force
SMTP authentication guessing botnet or piece of software. Whatever it
is, it's quite annoying and may also be unevenly distributed in action.
(I've mentioned them before in passing.)
I can't say with any confidence what it is, because it also seems
to be pretty dumb and limited. Our new authenticated SMTP server doesn't offer authentication
before you STARTTLS, but it will afterwards. This can't be an
uncommon configuration, yet I see a whole plague of ylmf-pc
machines connecting to it and then immediately disconnecting without
trying anything more (and in particular without STARTTLS). It's
as if they connect, examine the EHLO response, see no authentication
advertised, and then immediately disconnect.
Of course, that's when the real annoyance comes in; these machines aren't content with doing this just once. Oh no. A ylmf-pc machine will do this same connect, EHLO, then disconnect cycle over and over and over again, very fast. Our logs typically show multiple connects and disconnects a second. We have firewall connection limiters that cut in to temporarily block these IPs, but otherwise a ylmf-pc machine will also keep doing this for quite a while. This creates quite a bunch of log spam, even with the firewall blocking IPs for us.
I was going to confidently say that the ylmf-pc plague hits some of our machines much more than other ones and speculate about why, but it turns out that I can't; our inbound MX gateway doesn't even log machines that do this connect then disconnect game, so I can't tell whether or not the ylmf-pc brigade is ignoring them. They do seem to do at least a little bit of scanning of the Internet in general, but they also seem much more concentrated on machines with MX entries and machines with suggestive DNS names (such names seem to cause spammers to show up fast, although I haven't tried a scientific test of this).
(This is apparently the signature of a botnet called 'PushDo' or 'Cutwail', per this stackoverflow question and answer (also). The oldest mention I can find in my own logs is November of 2013, but it looks like this pattern may go back to 2012 and possibly earlier.)