2024-08-04
The speed of updates for signatures of bad things matters (a lot)
These days (and for a long time), most spam, phish, malware, and so on (in email and other things) is recognized not through general rules, patterns, and processes (eg), but by seeing if the content matches any known signatures. Sometimes this is literally matching cryptographic hashes, but more often there's some sort of signature matching engine involved with various matching operators, conditions for combining them, and so on. ClamAV is one example that's mostly a matching engine, which means that in practice you need a collection of signatures to make it useful. Since signatures aren't general things, they have to be created by someone and then you have to get that newly created (or perhaps updated) signature.
What people have collectively found is that in practice, the speed of updating signatures matters, often a lot; in fact it matters enough that people are willing to pay for faster updates to collections of signatures. Why it matters is pretty straightforward; you're in a race against attackers. Attackers are perfectly well aware that the effectiveness of what they're doing goes down fast once signatures are available for it (or in general once people have had time to recognize what's going on, get their web landing page killed off, or whatever), so they generally try to get things done as fast as possible.
(I'm sure there are some slow-moving spam, phish, and malware campaigns that keep on going and going, but I don't think they're very common.)
However, attackers have their own speed limits; they can only send so much so fast, to you and to everyone else. Against many attackers, this gives you the chance to cut off at least some of their activities if 'you' can react fast enough, which broadly means if you can get signature updates fast enough. In more sophisticated environments, fast signature updates may also give you the chance to re-scan people's recently received email messages before people open them (or when they open them).
(Similar things apply to scanning files or recognizing signs of active malware, especially since these may already be delayed from the initial attack depending on how the attacker got to people. If you're getting people to download malware from a web page by sending them a bait message, you have to wait for people to read their email.)
So in general, the faster you get signature updates, the less you'll be exposed to (and for a shorter amount of time). The slower the updates, the more you're exposed to and the longer you're exposed. In the extreme case, sufficiently delayed updates are mostly useless, because the attacker campaign they're reacting to is over by the time you get the updates active.
(Of course you can try to delay receiving things (and thus checking them), but this tends to be unpopular with people. Like it or not, modern email is expected to get through rapidly and as a result is used for time sensitive things.)
We've seen this ourselves when we changed from a commercial anti-spam system for our email to one mostly based on free software and free signature data sources for the anti-malware, anti-virus (and anti-phish) part. Even with paying for some signature sources, the free system clearly was less effective at matching and blocking new malware, and we're fairly certain that part of this was that the commercial system's signatures updated quite frequently (and the company involved had a bunch of people working on keeping them up to date).
(I think this is something that's well known to people in the communities that use signatures, like anti-spam and (anti-)malware, but is perhaps not so obvious to people outside those communities.)
2024-07-09
Some (big) mail senders do use TLS SNI for SMTP even without DANE
TLS SNI (Server Name Indication) is a modern TLS feature where clients that are establishing a TLS session with a server tell it what name they are connecting to, so the server can give them the right TLS server certificate. TLS SNI is essential for the modern web's widespread HTTPS hosting, and so every usable HTTPS-capable web client uses SNI. However, other protocols also use TLS, and whether or not the software involved uses SNI is much more variable.
DANE is a way to bind TLS certificates to domain names through DNS and DNSSEC. In particular it can be used to authenticate the SMTP connections used to deliver email (RFC 7672). When you use DANE with TLS over SMTP, using SNI is required and is also straightforward, because DNSSEC and DANE have told you (the software trying to deliver email over SMTP) what server name to use.
Recently, SNI came up on the Exim mailing list, where I learned that when it's sending email, Exim doesn't normally use SNI when establishing TLS over SMTP (unless it's using DANE). According to Exim developers on the mailing list, the reasons for this include not being sure of what TLS SNI name to use and uncertainties over whether SMTP servers would malfunction if given SNI information. This caused me to go look at our (Exim) logs for our incoming mail gateway, where I noticed that although we don't use DANE and don't have DNSSEC, a number of organizations sending email to us were using SNI when they established their TLS sessions (helpfully, Exim logs this information). In fact, the SNI information logged is more interesting than I expected.
We have a straightforward inbound mail situation; our domains have a single DNS MX record to a specific host name that has a direct DNS A record (IP address). Despite that, a small number of senders supplied wild SNI names of 'dummy' (which look like mostly spammers), a RFC 1918 IP address (a sendnode.com host), and the IP address of the inbound mail gateway (from barracuda.com). However, most sending mailers that used SNI at all provided our inbound mail gateway's host name as the SNI name.
Using yesterday's logs because it's easy, roughly 40% of the accepted messages were sent using SNI; a better number is that about 46% of the messages that used TLS at all were using SNI (roughly 84% of the accepted incoming messages used TLS). One reason the percentage of SNI is so high is that a lot of the SNI sources are large, well known organizations (often ones with a lot invested in email), including amazonses.com, outlook.com, google.com, quora.com, uber.com, mimecast.com, statuspage.io, sendgrid.net, and mailgun.net.
Given this list of organizations that are willing to use SNI when talking to what is effectively a random server on the Internet with nothing particularly special about its DNS setup, my assumption is that today, sending SNI when you set up TLS over SMTP doesn't hurt delivery very much. At the same time, that some people's software send bogus values suggests that fumbling the SNI name doesn't do too much harm, which is often unlike the situation with HTTPS.
PS: I suspect that the software setting 'dummy' as the SNI name isn't actually mail software, but is instead some dedicated spam sending software that's using a TLS library that has a default SNI name set and is (of course) not overriding the name, much as some web spider software doesn't specifically set the HTTP User-Agent and so inherits whatever vague User-Agent their HTTP library defaults to.
2024-05-31
Spammers do forge various noreply@<you> sender addresses
It is probably not news to anyone reading this that some of the time, spammers sending you email will forge the email as being from various addresses at your domain, for either or both of the SMTP 'MAIL FROM' envelope sender address and the From: header address. Spammers have been doing this to us for years. What I hadn't realized until now, when I looked at the actual addresses being forged, was that spammers were forging various variations on 'noreply@<us>', in various variations of words and cases. Over the past ten days we've seen all of 'noreply@', 'Noreply@', 'Nonreply@', 'no_reply@', 'NOREPLY@', 'no-reply@', and 'NO-REPLY@'.
Of course, spammers also forge various plausible administrative addresses as well, such as 'Administrator@', 'Admin@', 'cpanel@', 'support@' (and 'Support@'), and one case of 'hr@', as well as the expected 'postmaster@'. These are almost all addresses that don't exist here and never have, so I'm pretty confident that spammers are just making them up instead of drawing them from a list of (past) legitimate email addresses of people here. I suspect that some or perhaps many of these forged addresses are being used on phish spams, and this is probably the case for the various 'noreply@' addresses.
(Spammers clearly use old email address lists to generate their envelope sender addresses, because we reject a lot lot of SMTP 'MAIL FROM' addresses that used to be real email addresses here but which have since been removed (we do eventually close some accounts). Interestingly, there is also a relatively frequently forged sender address that is a single-letter typo for a real person's email address.)
One of the lessons I draw from this little exercise in curiosity is that if we've created administrative-like email addresses in our system simply to reserve them, and we aren't using them, we should actively block their use as external sender addresses. If we want to create a dummy 'cpanel@' address, for example, we should definitely make it so that it's not accepted as a SMTP envelope sender.
(Because of some features of our mail environment, people here can created valid email addresses without our involvement (this has various entirely legitimate uses, including expendable personal email addresses). Historically this has meant that we grabbed a number of addresses simply as precautions to reserve them, without ever intending them to be 'legitimate'.)
PS: We do have a local noreply-like address, for internal use. However, spammers don't seem to forge it on their messages, perhaps because it basically never appears on email we send to actual people and thus has never made it onto various spammer lists of email addresses here.
(All of the email that we send to people has real sender and reply addresses that are read by us, even if the mail is sent by automated systems.)
2023-12-23
A DKIM signature on email by itself means very little
In yesterday's entry on what I think the SMTP Smuggling attack enables, I casually said that you were safe if you ignored SPF results and only paid attention to DKIM. As sometimes happens, this was my thoughts eliding some important qualifications that I just take as given when talking about DKIM, but that I should spell out. The most important qualification is that a (valid) DKIM signature by itself means almost nothing, which is a bit unlike how SPF works.
First off, anyone can DKIM sign a message, provided that they control a bit of DNS (you could probably even do it in a mail client). Quite a lot of people, including spammers, can even DKIM sign email that is 'aligned' with the 'From:' header, which means that the DKIM signature is from the From: domain, not just from some random domain. A valid DKIM signature does provide definite attribution, and if it's for the From: domain, it more or less identifies who authorized the mail. Also, in practice lack of a DKIM signature is itself a signal, because an increasing number of places more or less require a DKIM signature, sometimes one that is from the From: domain.
(However, some people only have SPF records and this can be deliberately used to create email that can't be easily forwarded.)
A valid DKIM signature for the From: domain is at least as strong a sign as an SPF pass result. However, this doesn't mean that the email is any good, any more than an SPF pass does; spammers can and do pass both checks. Similarly, lack of a valid DKIM signature for the From: domain doesn't mean that it's not from that domain. To have some idea of that you need to check the domain's DMARC policy. In effect, the equivalent of SPF is the combination of DKIM and DMARC (or something like it).
So when I casually wrote about (only) paying attention to DKIM, I was implicitly thinking of using DKIM along with something else to tell you when DKIM results matter. This might be specific knowledge of which important domains you deal with DKIM sign their email (including your own domain), or it might mean checking DMARC, or both. And of course you can ignore both SPF and DKIM signatures, apart perhaps from logging DKIM results.
(We don't explicitly use DKIM signatures and DMARC in our Exim configuration, but these days we use rspamd for spam scoring and I think it makes some use of DKIM and perhaps DMARC.)
2023-12-22
What I think the 'SMTP Smuggling' attack enables
The very brief summary of SEC Consult's "SMTP Smuggling" attack is that under the right circumstances, it allows you (the attacker) to cause one mail server to 'submit' an email with contents and SMTP envelope information that you provide to a second mail server. To the second email server, this smuggled email will appear to have come from the first mail server (because it did), and can inherit some of the authentication the first mail server has.
(It's important to understand that the actual vulnerability is in the second mail server, not the first one; the first one can and often must be completely RFC compliant in its behavior.)
The obvious authentication that the smuggled email inherits is SPF, because that's based on the combination of the sending IP (the first mail server) and the SMTP envelope sender (and possibly message From:), which is under your control. So you can put in a SMTP envelope sender (and a From:) that claims to be 'from' the first mail server, and the second mail server will accept it as authentic.
(An almost as obvious thing is that the smuggled email gets to share in whatever good reputation the sending email server has with the receiver. This is most useful if you can get a big, high reputation mail system to be the first server, which is possible (or perhaps 'was' by the time you're reading this).)
If you forge email as being from something that has a DMARC policy that passes the policy if SPF passes, you can also get your forged email to pass DMARC checks. The same is true if the second email server happens to be something that imposes its own implicit DMARC-like policy that accepts email if SPF passes and (and possibly that SPF is 'aligned' with the From: message address).
What you can't fully do is inherit DKIM authentication. You can add your own valid DKIM headers to your smuggled email, but you can only do this for domains with DNS under your control (or domains where you've managed to obtain the DKIM signing keys). This probably doesn't include the first email server and its domain, and because the first email server doesn't recognize your smuggled email as an actual email message, it won't DKIM sign the email for you. The only way you can get the domain of the first email server to DKIM sign your second email for you is if the second email server is also an internal one belonging to the same domain and it will DKIM sign outgoing messages. This general configuration is reasonably common (incoming and outgoing email servers are often different), but usually they run the same mail software and so they won't have the different interpretations of the email message(s) that SMTP Smuggling needs.
The result of this is that if the second (receiving) email server doesn't check SPF results and only pays attention to DKIM (which is increasingly mandatory in practice), it's almost completely safe from SMTP Smuggling even if it accepts things other than 'CR LF . CR LF' as the email message terminator. Since SPF breaks things (also), this is what I feel you should already be doing.
2023-12-20
The (historical) background of 'SMTP Smuggling'
The recent email news is SEC Consult's SMTP Smuggling - Spoofing E-Mails Worldwide (via), which I had a reaction to. I found the article's explanation of SMTP Smuggling a little hard to follow, so for reasons that don't fit within the scope of today's entry, I'm going to re-explain the central issue in my own way.
SMTP is a very old Internet protocol, and like a variety of old Internet protocols it has what is now an odd and unusual core model. Without extensions, everything in SMTP is line based, with the sender and receiver exchanging a series of 7-bit ASCII lines for commands, command responses, and the actual email messages (which are sent as a block of text in the 'DATA' phase, ie after the sender has sent a 'DATA' SMTP command and the receiver has accepted it). Since SMTP is line based, email messages are also considered to be a series of lines, although the contents of those lines is (mostly) not interpreted. SMTP needs to signal the end of the email text being transmitted, and as a line based protocol it does this by a special marker line; a '.' on a line by itself marks the end of the message.
(In theory there's a defined quoting and de-quoting process if an actual line of the message starts with a '.'; see RFC 821 section 4.5.2, which is still there basically intact in RFC 5321 section 4.5.2. In practice, actual mailer behavior has historically varied.)
When you have a line based protocol you must decide how the end of lines are marked (the line terminator). In SMTP, the official line terminator is the two byte (two octet) sequence 'CR LF', because this was the fashion at the time. This includes the lines that are part of the email message that is sent in the DATA phase, and so the last five octets sent at the end of a standard compliant SMTP message are 'CR LF . CR LF'. The first 'CR LF' is the end of the last line of the actual message, and then '. CR LF' makes up the '.' on a line by itself.
(This means that all lines of the message itself are supposed to be terminated with 'CR LF', regardless of whatever the native line terminator is for the systems involved. If you're doing SMTP properly, you can't just blast out or read in the raw bytes of the message, even apart from RFC 5321 section 4.5.2 concerns. There are various ESMTP extensions that can change this.)
Unfortunately, SMTP's definition makes life quite inconvenient for systems that don't use CR LF as their native line ending, such as Unix (which uses just LF, \n). Because SMTP considers the email message itself to be a sequence of lines (and there's a line length limit), a Unix SMTP mailer has to keep translating all of the lines in every email message it sends or receives back and forth between lines ending in \n (the native format) and \r\n (the SMTP wire format). Doing this translation raises various questions about what you should send if you encounter a \r (or a \r\n) in a message as you send it, or encounter a bare \n (or \r) in a message as you receive it. It also invites shortcuts, such as turning \r\n into \n as you read data and then dealing with everything as Unix lines.
Partly for this reason and partly because CR LF line endings make various people grumpy, there has been somewhat of a tradition of mailers accepting other things as line endings in SMTP, not just CR LF. Historically a variety of Unix mailers accepted just LF, and I believe that some mailers have accepted just CR. Even today, finding SMTP listeners that absolutely require 'CR LF' as the line ending on SMTP commands isn't entirely common (GMail's SMTP listener doesn't, for example, although possibly this will cause it to be unhappy with your email, and I haven't tested its behavior for message bodies). As a result, such mailers can accept things other than 'CR LF . CR LF' as the SMTP DATA phase message terminator. Exactly what a mailer accepts can vary depending on how it implemented things.
(For instance, a mailer might turn '\r\n' into '\n' and accept '\n' as a line terminator, but only after checking for a line that was an explicit '. CR LF'. Then you could end messages with 'LF . CR LF', without the initial 'CR'; the bare LF would be taken as the line terminator for the last data line, then you have the '. CR LF' of the official terminator sequence. But if you sent 'LF . LF', that wouldn't be recognized as the message terminator.)
This leads to the core of SMTP Smuggling, which is embedding an improper SMTP message termination in an email message (for example, 'LF . LF'), then after it adding SMTP commands and message data to submit another message (the smuggled message). To make this do anything useful we need to find a SMTP server that will accept our message with the embedded improper terminator, then send the whole thing to another mail server that will treat the improper terminator as a real terminator, splitting what was one message into two, sent one after the other. The second mail server will see the additional mail message as coming from the first mail server, although it really came from us, and this may allow us to forge message data that we couldn't otherwise.
(There are various requirements to make this work; for example, the second mail server has to accept being handed a whole block of SMTP commands all at once. These days this is a fairly common thing due to an ESMTP extension for 'pipelining', and also because SMTP receivers have to do extra work to detect and reject getting handed a block of stuff like this. See the original article for the gory details and an extended discussion.)
What you can do with SMTP Smuggling in practice has some limitations and qualifications, but that's for another entry.
2023-11-04
The various meanings of DKIM signing message headers
When I talked about the issue of what headers to include in email DKIM signatures, I didn't really cover the specifics of how you DKIM sign email headers and what the various options mean. The specifics can matter, especially since they help you (me) understand and navigate through the options that mailers (such as Exim) offer here.
In email messages, DKIM signatures appear in a DKIM-Signature
header, which lists a bunch of parameters:
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed; d=list.zfsonlinux.org; h=from:to:subject:message-id:in-reply-to:references:date [....]
The 'h=' list (which isn't complete here) is a list of headers that have been signed. More specifically, it's a list of instances of headers. If there are multiple instances of a given header in a message, DKIM defines an order to them and the instances of the header are checked (or used) in that order. So if you include 'from' once in the DKIM header list, you are saying that your DKIM signature includes DKIM's first 'From:' header in the message. If a second 'From:' header is added to the message, it's not included what's covered by your DKIM signature; it can have any value and the message will still pass DKIM validation.
As mentioned last time, including a header that doesn't exist in the DKIM signature signs its absence; if that header is then added to the message, the DKIM signature will become invalid. DKIM signing things that aren't there is sometimes called oversigning a header; you're not just signing what's present, you're also signing what's not. As a corollary of this, if you want to seal a message against having extra copies of some headers added, you can deliberately oversign existing headers. This is done by including their names an extra time in the h= list; the first time signs the existing header, and the second time signs that there's no second header. So if we wanted to make sure no one added a second 'From:' to a message, we'd sign 'h=from:from:[....]'.
One reason to oversign existing headers that should only appear once is that anyone who adds a second 'From:', 'Date:' or whatever to your message is probably up to no good. Another reason is that it's hard to predict which instance of the header a mail client will show to people reading the message, and there are probably some mail clients that will show the wrong instance of the header (the instance that isn't covered by your DKIM signature and so can be set to anything by an attacker).
This creates several options and decisions:
- do you make it so that certain headers can't be added to the message later, like the List-* and Resent-* families, or allow them to be added later?
- what headers do you sign if they're present? For example, should you sign Resent-* or List-* headers at all?
- do you oversign some existing headers so that no additional copies can be added?
Based on a quick skim of email that I have handy, relatively few sources of mail seem to be oversigning existing headers. However, GMail does oversign at least some email for core headers like From: and Subject:. Since Google is one of the eight hundred pound gorillas of email, if they're doing it people's DKIM signature validation is at least prepared to cope with this.
(I suspect that having two From:, Subject:, or so on headers trips enough spam detection systems that attackers don't normally do it.)
2023-10-26
The issue of what headers to include in your DKIM signatures
Increasingly, you have to sign your outgoing email messages with DKIM. When you use DKIM to sign things, in one sense you're signing an abstract 'email message', and in another, more concrete sense, you're signing the email body plus some of the email message headers. You might innocently think that the message headers to sign are standardized and obvious, but I've recently learned that neither is the case due to a recent discussion on the Exim mailing list. Different mail systems may sign different sets of headers in ways that are more or less aggressive, and some of these ways have downstream effects.
(This is especially relevant to Exim, where the default configuration of what headers to sign is perhaps somewhat aggressive.)
A basic part of DKIM signing is that if a message doesn't have a particular header and you include it in the DKIM signature headers anyway, what you're doing is signing that there is no such header in the email; basically, the header is interpreted as having a null value. If someone adds the header later, it will have a non-null value and so fail the DKIM signature check. Signing nonexistent headers is important if you think that adding them would change the meaning of the message as people perceive it (or as they see it).
As far as what headers to include goes, RFC 6376 provides relatively little guidance in section 5.4 and then a big and somewhat questionable list in section 5.4.1. Some headers are in practice part of the meaning of the message as people reading it will perceive things; in this category I'd include From: (which is required anyway), Subject: and Date:, and probably To:, cc:, and Reply-To:, and in practice I'd roll in In-Reply-To and References and some others. Some headers will change the interpretation of the message body if modified so must be protected by the DKIM signature; this includes all MIME related headers.
But then you have headers that may or may not change what you see as the meaning of the message if they're added to it after your signature. In this category are both the Resent-* family of headers for resent messages and especially the List-* family of mailing list headers. In some environments, whether a message was sent directly to people or came through a (visible) mailing list matters, as does what mailing list; in those environments you probably want to include the List-* headers in your DKIM signatures. But in other environments, this is not critical and in fact your people may be sending messages to outside mailing lists and want this to not break the DKIM signatures of their messages so the post-mailing-list version of their email is still accepted by, for example, GMail.
(You can have a similar discussion about Resent-*. Maybe these headers should never be signed, maybe they should be signed only if they're present, and maybe they should always be signed so that if someone visibly resends a signed message, it no longer passes DKIM verification.)
Now that I'm aware of this issue, we're probably going to change away from the Exim default (which signs all of the section 5.4.1 headers, plus the MIME headers) to something where we definitely don't sign the List-* headers and probably don't sign the Resent-* headers.
PS: One of the reasons to not sign Resent-* and List-* headers is that in both cases, you can do resending and mailing lists without changing the headers at all. Breaking DKIM signatures if people actually do add headers thus only encourages them to not add the headers; since adding the headers is useful and nice, we shouldn't discourage people from doing so.
2023-09-05
Having ClamAV reject email using the Malwarepatrol database seems unwise
In practice, ClamAV is both a virus and malware recognition engine and a collection of malware signatures. ClamAV only comes with a limited set of signatures, so supplementing it with additional third party sources is popular (and perhaps almost essential). Often people use update tools and scripts to configure and fetch these additional signatures, such as Fangfrisch. One of the popular providers of third party signatures is Malware Patrol, who have a number of tiers of access, including a (free) tier for educational institutions. Since we are an educational institution, we signed up for this tier and added it to the configuration of the third party update script we were using at the time so that it would be part of our email anti-spam filtering (when we switched over to ClamAV from our prior solution). Well, we thought we'd added it; in fact we'd made a configuration mistake such that we were silently failing to fetch the Malware Patrol database. We only noticed and fixed this mistake when we switched to Fangfrisch for our third party updates.
Soon afterward, our logs started reporting rather a lot of Malware Patrol hits and some people here started complaining that email to them was being rejected. Investigation showed that the rejections were from Malware Patrol signatures and the ones we could decode had what I would call alarmingly broad text matches that they were looking for (Malware Patrol uses ClamAV's body-based signature content format, generally with just a string it's looking for).
(One reason we couldn't decode what some Malware Patrol signatures were matching was that the Malware Patrol data is updated frequently, with signatures regularly being removed.)
Malware Patrol is fairly open and unapologetic about these broad matches in an article called Whitelisting for Block Lists. They specifically say:
Malware Patrol’s #1 goal is to protect customers from malware and ransomware infections. These days, this can mean blocking mainstream domains. Consequently, our customers report potential false positives for sites like docs(.)google(.)com, drive(.)google(.)com, dropbox(.)com and github(.)com. Systems like Google Docs serve files from their root directories. This forces some block list formats to then block the entire domain, frustrating users.
[...]
Although Malware Patrol doesn't say this explicitly, it appears that the ClamAV database format is one such format that sometimes forces them to block entire domains like 'drive.google.com' (we observed this in one signature). They suggest filtering their database before using it, but this has a number of problems; the ClamAV format is hex-encodes the ASCII bytes, for example, and on a larger scale it would mean we'd only be excluding things after people here had run into problems and reported them to us.
I don't fault Malware Patrol for their choice. The balance between false positives and false negatives is not one with a clear single answer, and Malware Patrol seems to have come down on the side of not having false negatives, even at the cost of false positives. But it does mean that Malware Patrol's objectives and ours aren't in alignment, as we care more about avoiding (too many) false positives than we do about avoiding every last false negative.
Our resolution to this was to take Malware Patrol out of our third party ClamAV data sources. I'm sure there are situations where using their database as part of ClamAV screening makes sense, but my view is that if you're rejecting email based on ClamAV signature matches, you likely can't use Malware Patrol's data. It's too dangerous unless you have a quite high tolerance for false positives. Even in a system where a Malware Patrol signature match only contributed to a message's spam score, I think you could only really add a modest increase in the odds of the message being spam.
(As far as I know, ClamAV stops looking once it's found a signature and the order it checks signature databases isn't documented. This means there's no way to tell it to check signature databases you trust more before Malware Patrol.)
PS: I don't know how common it is to use ClamAV signature matches to reject email, but it is, for example, an obvious way to configure Exim, especially since Exim's malware scanning documentation does this in its example.
2023-08-30
Email anti-spam (and really all anti-spam) is all heuristics now
On the Fediverse, I noted something:
This is my sad face when Spamhaus puts lists.ubuntu.com (185.125.189.65) in the SBL CSS. Something went wrong here. Well, several things, starting with Cantor & Siegel.
Back in the days, one of the things some people said about DNS blocklists in general and sometimes Spamhaus in particular was that they were opaque, capricious, and didn't actually validate what they were putting in their blocklists, so who knows what could wind up in there for who knows what reason. Those people would take this incident as a validation of their view.
(I was going to say that this was a long standing IP address used to send Ubuntu security announcements, but it looks like we only just started to get them from this IP, although the entire IP range is owned by Canonical.)
I have bad news for such people. This is what all email anti-spam systems are doing today. There are no effective anti-spam systems that are based only on sure positive signs of spam. Everything is an opaque black box full of heuristics and uncertainty, with hopefully occasional misfires that are hopefully not too spectacular. Sometimes people hand write rules and try to assess them, sometimes people take straightforward statistical approaches (eg, Bayesian scoring), and sometimes companies go for the complicated statistics that are generally known as 'Machine Learning' or these days 'AI' (in press releases, at least).
This is not an accident and it's not because people are lazy. It's because anti-spam isn't working against a blind natural phenomenon; instead, anti-spam is engaged in an iterated game against human driven spam. If there's a sure-fire signal of spam that can be used to reject or filter email, the humans driving spam are highly incentivized to get rid of it, and only the ones who are successful at that will survive.
This is simply one of the prices that spam exacts from us. We can no longer live in a world of certainty, where we can be confident that our anti-spam systems are right about things. And sometimes we'll see things that are so obvious (to us humans, on the spot, only having to look at this one incident) that they make us have sad faces.
(There's also the related issue that no one can afford to pay enough humans enough to constantly be evaluating and updating anti-spam rules and heuristics all of the time. All effective anti-spam systems have to operate partially automatically, and sometimes that will pass things that an alert human would not have.)