2007-02-25
Ordered lists with named fields for Python
I periodically find myself dealing with structures that are basically ordered lists with named fields, where elements 0, 1, and 2 are naturally named 'a', 'b', and 'c' and sometimes you want to refer to them by name instead of having to remember their position. This pattern even crops up in the standard Python library, often with functions that started out just returning an ordered list and grew the named fields portion later as people discovered how annoying it was to have to remember that the hour was field 3.
This being Python, I've built myself some general code to add named
fields on top of sequence types like list
or tuple
. For maximum
generality my code supports using field names both as attribute names
and as indexes, so you can use both obj.field
and obj["field"]
, and
you can even do crazy things like obj["field":-1]
. The code:
class GetMixin(object): fields = {} def _mapslice(self, key): s, e, step = key.start, key.stop, key.step if s in self.fields: s = self.fields[s] if e in self.fields: e = self.fields[e] return slice(s, e, step) def _mapkey(self, key): if isinstance(key, tuple): pass elif isinstance(key, slice): key = self._mapslice(key) elif key in self.fields: key = self.fields[key] return key def __getitem__(self, key): key = self._mapkey(key) return super(GetMixin, self).__getitem__(key) def __getattr__(self, name): if name in self.fields: return self[self.fields[name]] raise AttributeError, \ "object has no attribute '%s'" % name class SetMixin(GetMixin): def __setitem__(self, key, value): key = self._mapkey(key) super(SetMixin, self).__setitem__(key, value) def __setattr__(self, name, value): if name in self.fields: o = self.fields[name] self[o] = value else: self.__dict__[name] = value class Example(SetMixin, list): fields = {'a': 0, 'b': 1, 'c': 2}
The fields
class variable is a dictionary mapping the names of the
fields to their index offsets; it need not include all fields, and
not all named fields necessarily have values for a particular list
(since nothing checks the list length).
GetMixin just lets you read the named fields and can be mixed in with
tuples; SetMixin lets you write to them by name too, and so needs to
be mixed in with lists or other writable sequence types.
The easiest way to generate the fields
value for the usual case
of sequential field names starting from the first element of the list
is to use a variant of the enumerate
function from the itertools
recipes:
from itertools import * def enum_args(*args): return izip(args, count()) class Example(SetMixin, list): fields = dict(enum_args('a', 'b', 'c'))
(If you're going to do this a lot, make a version of enum_args
that
does the dict()
step too.)
Inheriting from list, tuple, etc does have one practical wart: you
probably want to avoid using field names that are the names of methods
that you want to use, because you won't be able to use the obj.field
syntax for accessing them. Amusingly, you will be able to set them
using that syntax, because __setattr__ gets called for everything,
existing attributes included (which is why it needs the dance at the
end with the instance's __dict__).
(This code is not quite neurotically complete; truly neurotically
complete code would make the available fields appear in dir()
's
output. But I don't want to try to think what sort of hacks that
would take, since I am seeing visions of dancing metaclasses that
automatically create properties for each field name.)
How CSLab currently does email anti-spam stuff
The Computer Science department is strongly against rejecting email just because it might be spam (at least by default); enough people would rather sort through spam than risk rejecting legitimate email. People are willing to have known viruses removed from their email (although not executables in general).
(For clarity: the weekly spam summaries I do are not for CSLab's mail system.)
I once summarized CSLab's general rule is 'thou shalt not reject email just because it smells bad'. We can reject email that has narrow technical failings such as nonexistent origin address domains, and do things that don't cause any problems with legitimate mailers but get spammers to give up. We can't reject on stuff that isn't a clear technical failing, and we can't do anything that causes problems for legitimate mailers.
All external email goes through a frontend machine running Exim 4. This machine does the following spam-related things:
- it waits a few seconds before spitting out the initial greeting
banner and the response to
EHLO
/HELO
; this is an attempt to persuade spam clients that they are being tarpitted so that they give up. Connections from IP addresses listed in zen.spamhaus.org are delayed longer.(This is not as good as the real OpenBSD
spamd
, which trickles out replies one character at a time; Exim just sits on the whole line for N seconds and then blasts it out. I got the general idea from Bob Beck's spamd presentation.) - the
MAIL FROM
domain has to exist (if it's one of our domains, the full address has to be valid). - the
RCPT TO
address has to be to us and valid. The frontend machine has a list of valid local usernames (including aliases and mailing lists and so on), so it can immediately reject email to nonexistent local users. - at
RCPT TO
time, addresses that have opted into it immediately reject email from senders in zen.spamhaus.org, and greylist most everyone else (usinggreylistd
, which is a general daemon for doing this). At the moment we have no convenient way for users to opt into this, so it is mostly protecting system aliases. - if the sender is in zen.spamhaus.org, we add a message header about it.
- the message is run through Sophos PureMessage, which removes known
viruses and, if the message has a high enough spam score, adds a note
about it to the start of the
Subject:
header.
After all this the email message is delivered to our central email
machine for actual processing and delivery and so on. We don't do
anything special with messages tagged as spam; each person gets to
decide for themselves how they want to handle such emails, whether
that is to filter them on the server with procmail
or leave it
up to their IMAP client's filtering or do nothing at all.
For an organization that doesn't want to reject email outright, I think that this sort of tagging is a big win; it makes things visible and it makes it easy for all sorts of clients to filter things. You need a reliable spam filter that doesn't need training, though.
We use Sophos PureMessage because the university has a site-wide license for it, so it doesn't cost us anything, and the central campus email system uses it and likes it. In my experience it does a good but not perfect job at recognizing spam, and I've only gotten a few reports of false positives. (And Sophos maintains the spam and virus filtering rules instead of us.)
Things we don't do (that sometimes surprise people):
- reject
HELO
s that claim to be from us. This is merely a bad smell, not a narrow technical defect. - general greylisting, because there are legitimate mailers that are known to have problems with it.
Exim does reject some badly formed HELO
s by default, and we have left
that on; I consider that to be a narrow technical defect issue. We also
reject email to IP address domain literals, which I believe is another
Exim default.
We are not currently doing nolisting, but we may in the
future; there are defensible technical reasons for having a lower
preference MX pointing to our internal central email machine, and
its SMTP port isn't reachable from the outside world any more.
Weekly spam summary on February 24th, 2007
This week, we:
- got 15,188 messages from 253 different IP addresses.
- handled 21,573 sessions from 1,281 different IP addresses.
- received 238,853 connections from at least 71,848 different IP addresses.
- hit a highwater of 10 connections being checked at once.
Connection and session volume is down a bit from last week. Day to day volume fluctuated up and down through the week:
Day | Connections | different IPs |
Sunday | 29,706 | +11,012 |
Monday | 40,386 | +12,084 |
Tuesday | 41,718 | +12,719 |
Wednesday | 34,748 | +10,352 |
Thursday | 36,413 | +9,568 |
Friday | 32,318 | +9,189 |
Saturday | 23,564 | +6,924 |
Kernel level packet filtering top ten:
Host/Mask Packets Bytes 205.152.59.0/24 27609 1252K 207.145.125.204 25029 1272K 206.223.168.238 15375 843K 213.29.7.0/24 8533 512K 211.136.0.0/14 7240 386K 67.95.56.42 6865 319K 203.89.173.58 6836 301K 204.202.15.102 6800 336K 81.201.105.157 5045 242K 204.202.23.184 4987 246K
This is up substantially from last week. The big news this week is that I blocked 205.152.59.0/24 very early on in the week; this is Bellsouth's outgoing mail servers. We no longer accept email from Bellsouth because they have gotten into the free webmail business, and as a result are now active participants in the advance fee fraud spam business. (Many US ISPs have apparently gone this direction, for reasons I don't understand.)
- 207.145.125.204, 67.95.56.42, 204.202.15.102, and 204.202.23.184 all kept trying to send email with an origin address that had already tripped our spamtraps, mostly for what looks like phish spam (certain sorts of origin addresses are dead giveaways).
- 206.223.168.238 is in the CBL.
- 203.89.173.58 kept trying with a bad
HELO
. - 81.201.105.157 is in the NJABL.
All that makes this a highly atypical week; for example, we don't have a single top-10 IP address that we've seen before. In the good news front, 208.99.198.64/27 continued not sending us so much as a single connection attempt over the week, and have thus dropped off my radar for future reports.
Connection time rejection stats:
69674 total 43536 dynamic IP 17981 bad or no reverse DNS 6394 class bl-cbl 295 class bl-njabl 250 class bl-sdul 220 class bl-pbl 159 acceleratebiz.com 147 class bl-sbl 144 class bl-dsbl 33 inetekk.com 15 cuttingedgemedia.com
Overall volume is about the same as last week. The SBL breakdown is slightly interesting:
59 | SBL51080 | phish spam source |
17 | SBL49074 | hijacked server that's spamming (13 Dec 2006) |
11 | SBL49046 | advance fee fraud spam source (13 Dec 2006) |
10 | SBL50375 | a /25 ROKSO listing for Eric Reinertsen (29 Jan 2007) |
10 | SBL49248 | saigonnet.vn webmail, listed as an advance fee fraud spam source (18 Dec 2006) |
Of these, SBL49046 and SBL50375 appeared in my summary last week, at about the same volume.
Three of the top 30 most rejected IP addresses were rejected 100
times or more this week: 193.4.194.142 (216 times, bad reverse DNS),
64.166.14.222 (168 times, dynamic IP), and 81.201.105.157 (153
times, on the NJABL). Eight of the top 30 are currently in the
CBL, eight are currently in bl.spamcop.net
, 10 are in the PBL, a grand total of 17 are in the combined
zen.spamhaus.org zone, and one is in
the SBL: 69.15.58.106, SBL51080.
This week Hotmail managed:
- 4 messages accepted, two of them probably legitimate.
- no messages rejected because they came from non-Hotmail email addresses.
- 57 messages sent to our spamtraps.
- 10 messages refused because their sender addresses had already hit our spamtraps.
- 5 messages refused due to their origin IP address (3 from the Cote d'Ivoire, one from Nigeria, and one in the CBL).
And the final numbers:
what | # this week | (distinct IPs) | # last week | (distinct IPs) |
Bad HELO s |
877 | 101 | 979 | 155 |
Bad bounces | 16 | 12 | 9 | 8 |
The winner of the bad HELO
contest this week was 72.165.125.122,
with 125 rejections until it got blocked; the next highest source
only managed 61. It's sad to see the bad bounce numbers start rising
again, but they're still low, and this week they seem to have come
from all over, including a darpa.mil machine and something in the
Arab Emirates that has been forging its HELO
name and so won't be
talking to us any more.
Bad bounces were sent to 13 different usernames this week, mostly to
real ex-users and plausible usernames. There was one alphabetical
jumble, and E07
and 3E4B
also put in appearances. The most popular
bad bounce targets (admittedly at 3 and 2 hits respectively) were both
ex-users.