The performance benefits of a despammed newsfeed

Surprisingly, despamming an incoming news feed can make a news machine perform better than before, provided that you despam in the right place. The benefits can be substantial even with a modest rejection level.

Our news machine went from a perpetual backlog to keeping up when we installed our first serious set of spam filters and started rejecting about 30% or so of our incoming feed. Other news administrators have reported similar results on news.software.nntp. The conclusion seems simple: spam filtering can stretch news machine performance.

But why?

Where the bottleneck is

On most machines receiving enough of a newsfeed to notice and using a traditional news layout (where each newsgroup has a directory and each file is a single message) and using a normal Berkeley Fast FileSystem (FFS) derived filesystem, it turns out that the bottleneck is in writing the articles into the spool. This is primarily because creating and linking new files is a slow operation, especially on large directories.

This seems to hold true (at least on our news machine) whether or not one allows the underlying metadata writes to be done asynchronously (and this is in general dangerous to allow). This is probably because even with asynchronous writes the directory operations are widely distributed across the disk (one rarely gets a stream of successive articles all going into the same newsgroup), which forces a lot of seeks.

Implications

Since the bottleneck in receiving news is in writing it to the spool, every less article that the system can drop before it has to is a win. This is why spam filtering before your news system has to file the article gives a net performance increase, even if it increases the amount of CPU burned.

This remains the case even when it takes a moderately substantial amount of CPU to do the rejection on current hardware; a modern CPU is not challenged by receiving news. Even quite expensive filtering, such as performing an MD5 hash on the body of almost every received article, can be a net win if it lets one reject enough articles.

A corollary

If your spam filter removes articles after they've been filed, you are not going to get much of a performance increase (you may get a small one by keeping directories less full). It is thus better to filter than to count on spam cancels to clean things up (especially since spam cancels are themselves articles that have to be filed in the spool). NoCEMs are better than spam cancels because you don't have to file so many articles.

This is also why purging binaries in non-binary groups as part of a spam filter is a win over running a separate program to scan new articles and delete them.

How fast is despamming itself?

Our experience is that despamming itself is very fast and doesn't take up very much time, even with quite expensive filters. Anecdotes from other news administrations on news.software.nntp generally echo this.

This makes sense: modern systems run much faster than their disks, so peering at articles in memory is going to be much faster than shoving the disk around.

Further information

This discussion is part of our Usenet despamming software, which is part of the antispam software we are making available to the Internet community. You might also be interested in the reports produced by our software.


This page and much of our precautions are maintained by Chris Siebenmann, who hates junk email and other spam.