Our narf filter

Our filter is a by now highly tortured version of Jeremy Nixon's cleanfeed-inn 0.94.4 (and modified with some of the good stuff from 0.95) that he probably wouldn't recognize. Jeremy has written a very good introduction to cleanfeed that, among other things, discusses what it filters; you can read it here. We're mostly going to discuss where our filter differs from Jeremy's.

Our filter has been sufficiently adopted so that it can no longer run as an INN perl filter. This is due both to its use of use to load an MD5 module, and to it looking at any header we thought useful as opposed to restricting ourselves to the set of headers that INN would give a filter.

What we've added rejections for

Although this is an attempt to describe what our filter does well enough so that people can decide whether they want to use it or not, the ultimate reference is the source code (it may help to search for XXX in the source code, as this often but not always annotates changes from the original cleanfeed-inn-0.94.4).

Our filter is very aggressive (you can see just how aggressive by reading our daily reports). It also contains a number of policy decisions about things we want and don't want, some of which you may disagree violently with.

Rejecting cancels

Unlike most filters we reject cancels when it appears safe to do so. Axiomatically it is safe to do so based on excessive crossposting or poison newsgroups, because we would have refused the original article too (proper spam cancels are crossposted to the same set of newsgroups as the original article). While this does prevent the cancels from propagating to our downstreams, we wouldn't have propagated the spam to them either; if they got fed the spam from a backup path, they can get the cancels that way too.

We also reject cancels that are from various known sources of forged or illicit cancels, or that match certain patterns generally used by third-party cancel forgers. Cleanfeed had a certain amount of these but we have expanded its set significantly.

Early rejection is good

Because narf defaults to dropping cancels for rejected articles when it seems safe to do so, our filter attempts to present as many safe rejections to narf as possible. The general rule this leads to is that it's better to reject spam by an explicit rule than by learning it as a new spam. This in turn leads us to somewhat of a no mercy attitude on Usenet spam and likely-seeming future Usenet spam.

The filter also attempts to reject as many articles as possible by only looking at the headers. Looking at the headers is cheap, but looking at the article bodies may be expensive since they are often quite large compared to the headers.

Recognizing new spam via MD5 body hashes

The normal cleanfeed filters recognize new spam using techniques that, while the best that can be done in a perl INN filter in reasonable time, have the potential for false positives. Our filter's primary means of recognizing new spams is to perform an MD5 hash on the article body and then count the SBI for things that hash identically (SBI is covered in the spam FAQ). Counting SBI is substantially more aggressive for crossposted articles than either the normal BI or simply counting each post as one (as the normal filters do).

There are two side effects of hashing only the body. First, articles with identical bodies but different subjects will trigger the detection. Second, articles with a small (or no) body are likely to be considered identical. We find the likely junking of empty articles to be aesthetically pleasing, but you may disagree.

Shunning open news servers

We refuse all articles (both normal and cancels) from open news servers that we find out about, because such sites are almost always spammer magnets. Indeed we usually find out about new open news server by examining our top sources of rejected articles reports.

Recruiter-Be-Gone

Recruiters are currently overrunning all jobs groups, producing enough volume to both make them unusable and to have an impact on your news server's performance due to large directories. Rather than drop the groups entirely we have decided to reject articles from recruiters to the best of our ability, whether or not they would normally qualify as (cancellable) spam. There is no automated code to recognize new recruiters; instead we just read misc.jobs.offered and whack new ones into the filter by name.

Crossposts-Be-Gone

We aggressively reject articles crossposted into a number of newsgroups. The newsgroups tend to fall into four categories:

explicit binaries groups outside alt.binaries, such as alt.sex.pictures
groups carrying pirated software (in part because they often have truly immense volumes, even for binaries groups)
inappropriate groups with names designed to catch the attention of sex spammers (eg, gnu.emacs.sex)
out-of-area regional or local jobs groups, as part of our campaign against recruiters. We feel that we are not interested in jobs in New York, Seattle, Los Angeles, or many other places (if we were, we would carry those hierarchies), and that someone who crossposts to them is either posting about a job in those areas (this is what the groups are for, after all) or is spamming.

As well as the usual cleanfeed rejection of excessive crossposting, we also reject articles crossposted into certain hierarchies that are crossposted to more than a few newsgroups; in our experience such articles are almost always spam.

Bad web site, no articles from you

We reject articles that name a large number of web sites; most of these web sites are sex spammers. We generally operate on the assumption that anyone spamming to advertising their sex web site is likely to do it again and add them to our lists.

We also add the web sites of people who trigger our spam detectors at least twice with ads for the same web site, on the grounds that if they've done it twice already they'll probably make up another variation soon.

People we don't like

There are a few persistent spammers (slow or otherwise) that we really don't like. Where these people have recognizable patterns, we've blocked them. There is no master list; read the filter source.

Various spam signatures

When finess is unavailable, we descend to brute force. At any one time there are a number of crude heuristic spam signatures, often very specific, that we look for and discard. Sometimes we write quite specific code; at other times, we attempt to create a general feature and then write specific pattern matchings to exploit it. The source is the best (and often only) reference for these.

For gored oxes:

It is possible that you will find yourself listed in the filter as something we reject. If this is the case, please do not bother mailing us to tell us that what you originate is not really spam; whatever you call it, we've decided that we don't want it. If on the other hand you were formerly an open news server and have closed yourself, please let us know and we'll remove you.

Tuning our filter

There are a number of filter options in our filter that can affect what it rejects and how much memory and CPU time it uses. This is an incomplete list (sometimes deliberately so; there are some options of interest mostly to people hacking the code) of some of them, with overviews of their effects. Please read the comments in the source before changing any of them.

$block_cancel: If set to 0 the filter does not block cancels that are excessively crossposted or posted to poisoned newsgroups (or would normally be dropped for any other reasons outlined above).
$allow_roving_binaries: Allow binaries in non-binary groups as long as they are crossposted to a binary group.
&filter_logname: This subroutine controls what rejected articles are logged into $DUMPDIR and under what names.
$md5max: A modest CPU saving might be had by reducing this number. However, don't reduce it too far; we routinely detect encoded binaries as new spam.
$BIHistSize: A lower value for this (or $ArticleHistSize) will lower the memory usage at the potential expense of letting some new spam go unlearned.
$EMPHistSize: Reducing this number will reduce the filter's memory usage while also reducing the number of learned spam signatures the filter can be rejecting articles for at the same time.
$maxpeerat: This is the amount of an article body that is looked at in quest of several spam signatures. Reducing this may reduce the memory usage of narf, especially over time, due to less memory getting copied around.
$block_can_badcyber: Blocks certain sorts of misformatted cyberspam cancels. We believe that this is safe to leave on.

Further information

This page is part of our narf pages.

This page and much of our precautions are maintained by Chris Siebenmann, who hates junk email and other spam.