Wandering Thoughts archives

2007-02-25

Ordered lists with named fields for Python

I periodically find myself dealing with structures that are basically ordered lists with named fields, where elements 0, 1, and 2 are naturally named 'a', 'b', and 'c' and sometimes you want to refer to them by name instead of having to remember their position. This pattern even crops up in the standard Python library, often with functions that started out just returning an ordered list and grew the named fields portion later as people discovered how annoying it was to have to remember that the hour was field 3.

This being Python, I've built myself some general code to add named fields on top of sequence types like list or tuple. For maximum generality my code supports using field names both as attribute names and as indexes, so you can use both obj.field and obj["field"], and you can even do crazy things like obj["field":-1]. The code:

class GetMixin(object):
    fields = {}

    def _mapslice(self, key):
        s, e, step = key.start, key.stop, key.step
        if s in self.fields:
            s = self.fields[s]
        if e in self.fields:
            e = self.fields[e]
        return slice(s, e, step)
    def _mapkey(self, key):
        if isinstance(key, tuple):
            pass
        elif isinstance(key, slice):
            key = self._mapslice(key)
        elif key in self.fields:
            key = self.fields[key]
        return key

    def __getitem__(self, key):
        key = self._mapkey(key)
        return super(GetMixin, self).__getitem__(key)
    def __getattr__(self, name):
        if name in self.fields:
            return self[self.fields[name]]
        raise AttributeError, \
              "object has no attribute '%s'" % name

class SetMixin(GetMixin):
    def __setitem__(self, key, value):
        key = self._mapkey(key)
        super(SetMixin, self).__setitem__(key, value)
    def __setattr__(self, name, value):
        if name in self.fields:
            o = self.fields[name]
            self[o] = value
        else:
            self.__dict__[name] = value

class Example(SetMixin, list):
    fields = {'a': 0, 'b': 1, 'c': 2}

The fields class variable is a dictionary mapping the names of the fields to their index offsets; it need not include all fields, and not all named fields necessarily have values for a particular list (since nothing checks the list length). GetMixin just lets you read the named fields and can be mixed in with tuples; SetMixin lets you write to them by name too, and so needs to be mixed in with lists or other writable sequence types. The easiest way to generate the fields value for the usual case of sequential field names starting from the first element of the list is to use a variant of the enumerate function from the itertools recipes:

from itertools import *
def enum_args(*args):
    return izip(args, count())

class Example(SetMixin, list):
    fields = dict(enum_args('a', 'b', 'c'))

(If you're going to do this a lot, make a version of enum_args that does the dict() step too.)

Inheriting from list, tuple, etc does have one practical wart: you probably want to avoid using field names that are the names of methods that you want to use, because you won't be able to use the obj.field syntax for accessing them. Amusingly, you will be able to set them using that syntax, because __setattr__ gets called for everything, existing attributes included (which is why it needs the dance at the end with the instance's __dict__).

(This code is not quite neurotically complete; truly neurotically complete code would make the available fields appear in dir()'s output. But I don't want to try to think what sort of hacks that would take, since I am seeing visions of dancing metaclasses that automatically create properties for each field name.)

python/ListsWithNamedFields written at 20:13:09; Add Comment

How CSLab currently does email anti-spam stuff

The Computer Science department is strongly against rejecting email just because it might be spam (at least by default); enough people would rather sort through spam than risk rejecting legitimate email. People are willing to have known viruses removed from their email (although not executables in general).

(For clarity: the weekly spam summaries I do are not for CSLab's mail system.)

I once summarized CSLab's general rule is 'thou shalt not reject email just because it smells bad'. We can reject email that has narrow technical failings such as nonexistent origin address domains, and do things that don't cause any problems with legitimate mailers but get spammers to give up. We can't reject on stuff that isn't a clear technical failing, and we can't do anything that causes problems for legitimate mailers.

All external email goes through a frontend machine running Exim 4. This machine does the following spam-related things:

  1. it waits a few seconds before spitting out the initial greeting banner and the response to EHLO/HELO; this is an attempt to persuade spam clients that they are being tarpitted so that they give up. Connections from IP addresses listed in zen.spamhaus.org are delayed longer.

    (This is not as good as the real OpenBSD spamd, which trickles out replies one character at a time; Exim just sits on the whole line for N seconds and then blasts it out. I got the general idea from Bob Beck's spamd presentation.)

  2. the MAIL FROM domain has to exist (if it's one of our domains, the full address has to be valid).
  3. the RCPT TO address has to be to us and valid. The frontend machine has a list of valid local usernames (including aliases and mailing lists and so on), so it can immediately reject email to nonexistent local users.
  4. at RCPT TO time, addresses that have opted into it immediately reject email from senders in zen.spamhaus.org, and greylist most everyone else (using greylistd, which is a general daemon for doing this). At the moment we have no convenient way for users to opt into this, so it is mostly protecting system aliases.

  5. if the sender is in zen.spamhaus.org, we add a message header about it.
  6. the message is run through Sophos PureMessage, which removes known viruses and, if the message has a high enough spam score, adds a note about it to the start of the Subject: header.

After all this the email message is delivered to our central email machine for actual processing and delivery and so on. We don't do anything special with messages tagged as spam; each person gets to decide for themselves how they want to handle such emails, whether that is to filter them on the server with procmail or leave it up to their IMAP client's filtering or do nothing at all.

For an organization that doesn't want to reject email outright, I think that this sort of tagging is a big win; it makes things visible and it makes it easy for all sorts of clients to filter things. You need a reliable spam filter that doesn't need training, though.

We use Sophos PureMessage because the university has a site-wide license for it, so it doesn't cost us anything, and the central campus email system uses it and likes it. In my experience it does a good but not perfect job at recognizing spam, and I've only gotten a few reports of false positives. (And Sophos maintains the spam and virus filtering rules instead of us.)

Things we don't do (that sometimes surprise people):

  • reject HELOs that claim to be from us. This is merely a bad smell, not a narrow technical defect.
  • general greylisting, because there are legitimate mailers that are known to have problems with it.

Exim does reject some badly formed HELOs by default, and we have left that on; I consider that to be a narrow technical defect issue. We also reject email to IP address domain literals, which I believe is another Exim default. We are not currently doing nolisting, but we may in the future; there are defensible technical reasons for having a lower preference MX pointing to our internal central email machine, and its SMTP port isn't reachable from the outside world any more.

spam/CSLabSpamFiltering written at 16:28:51; Add Comment

Weekly spam summary on February 24th, 2007

This week, we:

  • got 15,188 messages from 253 different IP addresses.
  • handled 21,573 sessions from 1,281 different IP addresses.
  • received 238,853 connections from at least 71,848 different IP addresses.
  • hit a highwater of 10 connections being checked at once.

Connection and session volume is down a bit from last week. Day to day volume fluctuated up and down through the week:

Day Connections different IPs
Sunday 29,706 +11,012
Monday 40,386 +12,084
Tuesday 41,718 +12,719
Wednesday 34,748 +10,352
Thursday 36,413 +9,568
Friday 32,318 +9,189
Saturday 23,564 +6,924

Kernel level packet filtering top ten:

Host/Mask           Packets   Bytes
205.152.59.0/24       27609   1252K
207.145.125.204       25029   1272K
206.223.168.238       15375    843K
213.29.7.0/24          8533    512K
211.136.0.0/14         7240    386K
67.95.56.42            6865    319K
203.89.173.58          6836    301K
204.202.15.102         6800    336K
81.201.105.157         5045    242K
204.202.23.184         4987    246K

This is up substantially from last week. The big news this week is that I blocked 205.152.59.0/24 very early on in the week; this is Bellsouth's outgoing mail servers. We no longer accept email from Bellsouth because they have gotten into the free webmail business, and as a result are now active participants in the advance fee fraud spam business. (Many US ISPs have apparently gone this direction, for reasons I don't understand.)

  • 207.145.125.204, 67.95.56.42, 204.202.15.102, and 204.202.23.184 all kept trying to send email with an origin address that had already tripped our spamtraps, mostly for what looks like phish spam (certain sorts of origin addresses are dead giveaways).
  • 206.223.168.238 is in the CBL.
  • 203.89.173.58 kept trying with a bad HELO.
  • 81.201.105.157 is in the NJABL.

All that makes this a highly atypical week; for example, we don't have a single top-10 IP address that we've seen before. In the good news front, 208.99.198.64/27 continued not sending us so much as a single connection attempt over the week, and have thus dropped off my radar for future reports.

Connection time rejection stats:

  69674 total
  43536 dynamic IP
  17981 bad or no reverse DNS
   6394 class bl-cbl
    295 class bl-njabl
    250 class bl-sdul
    220 class bl-pbl
    159 acceleratebiz.com
    147 class bl-sbl
    144 class bl-dsbl
     33 inetekk.com
     15 cuttingedgemedia.com

Overall volume is about the same as last week. The SBL breakdown is slightly interesting:

59 SBL51080 phish spam source
17 SBL49074 hijacked server that's spamming (13 Dec 2006)
11 SBL49046 advance fee fraud spam source (13 Dec 2006)
10 SBL50375 a /25 ROKSO listing for Eric Reinertsen (29 Jan 2007)
10 SBL49248 saigonnet.vn webmail, listed as an advance fee fraud spam source (18 Dec 2006)

Of these, SBL49046 and SBL50375 appeared in my summary last week, at about the same volume.

Three of the top 30 most rejected IP addresses were rejected 100 times or more this week: 193.4.194.142 (216 times, bad reverse DNS), 64.166.14.222 (168 times, dynamic IP), and 81.201.105.157 (153 times, on the NJABL). Eight of the top 30 are currently in the CBL, eight are currently in bl.spamcop.net, 10 are in the PBL, a grand total of 17 are in the combined zen.spamhaus.org zone, and one is in the SBL: 69.15.58.106, SBL51080.

This week Hotmail managed:

  • 4 messages accepted, two of them probably legitimate.
  • no messages rejected because they came from non-Hotmail email addresses.
  • 57 messages sent to our spamtraps.
  • 10 messages refused because their sender addresses had already hit our spamtraps.
  • 5 messages refused due to their origin IP address (3 from the Cote d'Ivoire, one from Nigeria, and one in the CBL).

And the final numbers:

what # this week (distinct IPs) # last week (distinct IPs)
Bad HELOs 877 101 979 155
Bad bounces 16 12 9 8

The winner of the bad HELO contest this week was 72.165.125.122, with 125 rejections until it got blocked; the next highest source only managed 61. It's sad to see the bad bounce numbers start rising again, but they're still low, and this week they seem to have come from all over, including a darpa.mil machine and something in the Arab Emirates that has been forging its HELO name and so won't be talking to us any more.

Bad bounces were sent to 13 different usernames this week, mostly to real ex-users and plausible usernames. There was one alphabetical jumble, and E07 and 3E4B also put in appearances. The most popular bad bounce targets (admittedly at 3 and 2 hits respectively) were both ex-users.

spam/SpamSummary-2007-02-24 written at 01:11:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.