Wandering Thoughts archives

2013-03-22

Looking at how many recipients our average inbound email has

One of the niggling problems of SMTP in the modern world (at least for us) is the mixed address problem, the fact that at DATA time your answer applies to all recipients. It would be much more convenient if all email messages had only a single recipient; then you could always apply just that recipient's content filtering views and enable much more rejection at SMTP time. Which leads to the question: how many recipients does an average message here have, especially inbound messages?

(Inbound messages are the most interesting ones, because those are the ones that all of our anti-spam stuff is applied to.)

Today, I decided to answer that question for our external MX gateway. The answer turns out to be that the overwhelming majority of email has only one recipient. The stats break down like this:

1 recipient 93%
2 recipients 3.6%
3 recipients 1.2%
4 recipients 0.6%
5 recipients 0.4%
6 recipients 0.2%
10 recipients 0.2%

(I think I'll stop there.)

This is from 89 days of logs, totaling 1.29 million messages received. It counts only actual accepted recipients so some of these messages may have had some of their RCPT TOs rejected already (I suspect that this is not a really big factor but I haven't looked).

The largest number of (accepted) recipients for a single message is 82 recipients (one messages). There are a similar handful of other messages with large recipient counts. Interestingly the largest 'large' message count is for 20 recipients (but it's still only 0.09% of all messages). There seems to be a hard break at 20 recipients; only 98 messages out of the 1.29 million had more recipients than that.

This has been interesting. Before I did these stats I would not have expected single-recipient messages to be so totally dominating (even though I'm familiar with things like VERP that strongly bias some traffic towards that). Possibly much more of our inbound email is mailing lists (including spam lists) than I expect.

Sidebar: detailed message counts for 7-20 recipients

This actually forms an interesting pattern so I'm going to give you the raw data:

   cnt   recipients
  1210   20
   641   19
   372   18
   184   17
   136   16
   113   15
   153   14
   173   13
   289   12
   820   11
  2081   10
  1428   9
  1568   8
  1925   7
  2156   6

(for 2-7 there is a steady dropoff.)

My guess is that a bunch of mailing list software really prefers to cut things at nice even (small) numbers of recipients.

spam/RecipientsDistribution written at 17:08:05; Add Comment

The problem with trying to make everything into a Python module

One of the reasons for Django's unpleasant project restructuring is that they want your website directory (ie the directory that your project sits in) to be a module that can be imported. This in fact seems to be somewhat of a general trend; all sorts of things rather want you to to have not just a collection of files in a directory but an actual module. I wish they'd stop. Modules are not the be all and end all in Python, at least not as currently implemented, and not everything needs or wants to be a module.

The general reason for making things into modules is namespaces for imports. If you're sitting in your project's directory and do 'import fred', in theory this is ambiguous; you might mean your fred.py or you might mean some global fred module installed in Python. The absolute form of 'import mystuff.fred' is more or less unambiguous.

(This preference for modules also goes with the fact that the relative import syntax, 'from . import fred', is only valid in an actual module. I think that this is a terrible mistake, but no one asked me for my opinion.)

I have no problem with modules as such. The problem I have is how you get a directory to be a module, namely that you add the directory's parent to the Python search path (in one of a number of ways), and then the directory becomes a module (or technically I think a package) called its directory name. This is bad in at least two ways. It tightly couples together the directory name and the module name and it also makes everything else in the directory's parent available as a potential module. What both of these have in common is undesired name collisions. For example, you cannot be working on two versions of a 'fred' module that are sitting in a directory as, say, src/fred-1 and src/fred-2, not unless you want to have a src/fred symlink that you keep changing back and forth.

(The natural structure seems to be to isolate each module in its own artificial parent directory (eg src/fred-1/fred) or to ignore the whole issue, put everything in src/, and assume you will never have any collisions or be developing a new version of fred that you don't want src/bob getting when it does an 'import fred'.)

What would make this situation okay is a simple way to tell Python 'directory X is module Y', where 'X' might be '.' (the current directory). This should be available both on the Python command line and from inside Python code. Sadly I don't expect this to arrive any time soon.

(This stuff irritates me for reasons that are hard to pin down. Partly it just feels wrong (eg '/src' or wherever isn't a directory of modules, so why am I telling Python that it is?).)

python/EverythingModuleProblem written at 00:18:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.