The question of how long our greylisting interval should be
One of the things that our frontend anti-spam system does is opt-in greylisting for people who don't mind the downsides of it. One of the decisions you need to make with greylisting is how long sending mail servers have to wait before you'll accept their message; right now, our system is configured with a delay of an hour.
It doesn't have this delay because I thought carefully about it, or did research. It has an hour delay because that's the standard default for the particular open source software for greylisting that we're using, or more specifically it was the standard default as of late 2006 or perhaps early 2007 when we first installed it. Even if the default was sensible then, a lot can happen on the Internet in a few years, especially in spam and anti-spam (which is a fast moving field in general).
All of this raises the question of how long a greylisting interval we should use today, and how to figure out what it should be. If we generously assume that no legitimate SMTP server will give up on retrying, what we want to study is basically the decay rate of the sending sources that do give up, which by assumption are bad sources. If we see that, say, 90% have given up after a minute, 95% have given up after five minutes, and 99% have given up after ten minutes, we can conclude that an hour of greylisting delay is a lot of overkill; we could turn it down to ten minutes and still get rid of almost as much spam while not delaying legitimate email anywhere near as much.
The greylisting daemon itself is the best place to capture this information; it already keeps track of first-seen and last-seen information for every greylisting record, and it could log these as it expires entries from its internal database. Unfortunately our greylisting daemon doesn't support doing this. I've considered trying to reconstruct this information from the Exim logs, but so far it's struck me as sufficiently annoying that I haven't looked into how to do it (which is laziness speaking).
(If you write a greylisting daemon, please include an option to log this sort of information.)
I'm also not sure how much useful data I can generate from our logs, since we only have a relatively small number of addresses that have opted in to greylisting in the first place. Unless we have greylisted addresses that are getting a sufficiently large amount of spam from a diversity of sources (in terms of programs and spam senders), looking at our current data might just give me biased answers.
(Ideally someone would have already done all of this analysis on a big site with a lot of email and thus a lot of spam from all sorts of sources and senders, and published the results. I'm not holding my breath on that one, partly because I suspect that any site large enough to generate interesting data is not going to share it because it's a competitive advantage.)