Email anti-spam (and really all anti-spam) is all heuristics now

August 30, 2023

On the Fediverse, I noted something:

This is my sad face when Spamhaus puts ( in the SBL CSS. Something went wrong here. Well, several things, starting with Cantor & Siegel.

Back in the days, one of the things some people said about DNS blocklists in general and sometimes Spamhaus in particular was that they were opaque, capricious, and didn't actually validate what they were putting in their blocklists, so who knows what could wind up in there for who knows what reason. Those people would take this incident as a validation of their view.

(I was going to say that this was a long standing IP address used to send Ubuntu security announcements, but it looks like we only just started to get them from this IP, although the entire IP range is owned by Canonical.)

I have bad news for such people. This is what all email anti-spam systems are doing today. There are no effective anti-spam systems that are based only on sure positive signs of spam. Everything is an opaque black box full of heuristics and uncertainty, with hopefully occasional misfires that are hopefully not too spectacular. Sometimes people hand write rules and try to assess them, sometimes people take straightforward statistical approaches (eg, Bayesian scoring), and sometimes companies go for the complicated statistics that are generally known as 'Machine Learning' or these days 'AI' (in press releases, at least).

This is not an accident and it's not because people are lazy. It's because anti-spam isn't working against a blind natural phenomenon; instead, anti-spam is engaged in an iterated game against human driven spam. If there's a sure-fire signal of spam that can be used to reject or filter email, the humans driving spam are highly incentivized to get rid of it, and only the ones who are successful at that will survive.

This is simply one of the prices that spam exacts from us. We can no longer live in a world of certainty, where we can be confident that our anti-spam systems are right about things. And sometimes we'll see things that are so obvious (to us humans, on the spot, only having to look at this one incident) that they make us have sad faces.

(There's also the related issue that no one can afford to pay enough humans enough to constantly be evaluating and updating anti-spam rules and heuristics all of the time. All effective anti-spam systems have to operate partially automatically, and sometimes that will pass things that an alert human would not have.)

Comments on this page:

I see what you're saying, and I think I agree with most of it. However, I do take issue when people who get their email on such an "opaque black box full of heuristics and uncertainty" claim it's the sender's problem when email is accepted by the black box but not received by the black box's users.

If your mail server's admins are going to have your mail server perform complex operations that drop received email into something other than your inbox, however necessary and unavoidable those operations are on the modern internet, and you miss email you wanted to read as a result, that is something you need to take up with your mail server's admins, not the sender. If your mail server's admins are unresponsive, possibly because you're not actually paying them but getting email as a free service, that is still your problem, and still nothing to do with the sender.

I do feel a clearer understanding of the line of responsibility here would improve email for everyone.

The only real solutions I see to this are whitelists, blacklists, and Hashcash. The last has been ruthlessly stopped by large corporations in favour of reputation, because these large corporations all want to send spam. Personally, I believe accidentally missing an e-mail to be unacceptable. These same large corporations like to have their servers lie about accepting e-mails, because they want to destroy what they can't control.

It's because anti-spam isn't working against a blind natural phenomenon; instead, anti-spam is engaged in an iterated game against human driven spam.

Let's not forget that proof-of-work is bad for the environment, that message brought to us by the same corporations that callously waste resources and produce mountains of plastic.

By cks at 2024-02-16 15:31:29:

I disagree, for two reasons. First, charging for email in general is not going to stop spam, although it will change what sort of spam you get. This includes Hashcash, especially now that you can rent compute capacity as you need it (so people who want to send out a marketing email campaign can literally pay for the Hashcash costs, were they exist). Second, Hashcash harshly penalizes legitimate senders of significant amounts of email, including mailing lists, who see their compute needs and thus costs go up drastically. Hashcash is a non-starter in anything like even a traditional, pre-spam Internet email environment, much less today's non-spam email environment.

(Plus, active criminal spammers have plenty of compute capacity they can rent for cheap, cf botnets for hire.)

I disagree, for more than two reasons. As e-mail isn't time-sensitive, it would be feasible to have proof-of-work requirements that last into the tens of minutes or hours. This basically kills the drive-by spam. In a system without an authority, the idea of legitimacy loses most of its meaning. Regardless, Hashcash could be used as a way to get a sender address into a whitelist, and no address in a whitelist would need Hashcash to be accepted. Hashcash could be used entirely as a signal for automatic categorization like this. Free Software mailing lists, and similar such lists, could be collected into some whitelist for interested parties.

Hashcash is a non-starter in anything like even a traditional, pre-spam Internet email environment, much less today's non-spam email environment.

Explain why. I'm not seeing it. The main opposition to Hashcash are the aforementioned companies which want to send their spam to everyone.

Written on 30 August 2023.
« Experiencing the increase in web bandwidth usage for myself
The technical merits of Wayland are mostly irrelevant »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 30 21:08:03 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.