Wandering Thoughts: Recent Entries

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web.

2009-11-20

Spam and the attraction of reach

Here is a thesis: the larger or more standardized the environment for sending messages, the more spam you should expect to get in or through it. Accordingly, email is heavily abused because it is hugely standardized.

The spammer's motivation for abusing larger, standardized environments is obvious; the larger the environment, the more people you can reach with a single technique, approach, or system. Larger environments have better return on effort, since generally (but not always) most of the effort in spamming in an environment is figuring out how to do it well.

(This ties in to how spammers are lazy but not stupid, at least not in the aggregate.)

This is depressing because it implies that any well used service that allows push messages is going to have spam no matter what you do. If you build such a service or protocol and it gets popular, you'll get spam. (In fact, degree of spam is not a bad metric for degree of popularity. And if the spammers abandon you, well, worry.)

It is tempting to say that one important way to discourage spammers is to shift the relative costs so that as much effort as possible is per-message effort; if nothing else, this might make you less attractive than the next target. However, I think that the general history of people's anti-spam efforts in new systems shows that this ultimately doesn't work; if you're attractive enough for regular users, you're easy enough for spammers.

(See also DeterringAbuseProblem on this general issue.)

spam/StandardizedSpam written at 01:01:48; Add Comment

2009-11-19

The corollary for effective anti-spam heuristics

Last time I mentioned that spammers were perfectly capable of adopting their practices to defeat anti-spam heuristics like requiring a valid EHLO or reverse DNS, and so such heuristics were, if effective and widely adopted, at best a temporary fix. This raises an obvious corollary about good anti-spam heuristics.

Since spammers will adopt when it is both useful and possible, a good anti-spam heuristic is some characteristic of the message or of how it is transmitted that the spammer cannot easily change. While people have made various stabs at this in the past (and will no doubt continue to do so in the future), the problem for anti-spam efforts is that such characteristics have been hard to find, partly because spammers have proven to be very ingenious about finding ways to change them.

(For a small example, are anti-spam systems matching on the characteristic phrases of your advance fee frauds in email? No problem, just put your pitches in file attachments. I await with resignation the day the spammers start sending PDFs, not just Word .doc files, since a sufficiently ingenious spammer can make a PDF that is very hard to analyse.)

I am not convinced that it's even theoretically possible to come up with good (under this definition) anti-spam heuristics in any sort of general environment, partly for reasons that run up against the fundamental spam problem.

(While current heuristics are effective, my strong impression is that they are a laboriously maintained and ever-evolving collection of more or less ad-hoc rules. This doesn't necessarily scale, and it's expensive.)

spam/HeuristicsCorollary written at 00:47:39; Add Comment

2009-11-18

Universities are open environments

One of the things that's led to the university Internet environment changing (per an earlier entry) is that universities are open environments in general and especially in terms of services. In this they are fundamentally different from companies, which can be much more closed and closeted environments.

I think that there's three reasons for this. First, there is a much different relationship between many people at the university and the university. In a company, everyone 'at' or 'in' the company is working for the company, but in a university the majority of the user base is effectively customers, and this creates significantly different expectations.

Second, one way that these expectations manifest is that a company has much more scope to plead security and secrecy in order to keep services inside its walls. In a company you can assert with a straight face that you have privacy concerns in putting company email on some outside provider. In a university, the students will say 'so what? I don't care'. And in general I think that there is more acceptance of secrecy and security as valid concerns at a company than at a university; at a company they are defaults, while at a university there is at least a theory of transparency and operating in the open.

Finally and I think significantly, universities are open in good part because people are flowing through them all the time. Every year N people show up and N people leave, more or less, and at least in theory these people should be significant users of your services. This constant and significant flow works to destroy any insularity and ignorance about the outside world's progress that might build up in general, and when combined with the relation between students and university creates an environment where you are constantly justifying your services to the next generation of arrivals (whether or not you realize it).

(This degree of turnover is also another strike against claims of secrecy and security. As I've said before, at a university you have to assume that there are plenty of evil people already inside your organization.)

Or in short: the university is open because people keep walking through, bringing in knowledge of the outside (and leaving with knowledge of the university).

tech/UniversityOpenEnvironments written at 00:44:42; Add Comment

2009-11-17

Finally understanding the appeal of 'Interfaces'

I spent a long time not really getting the need for coded, explicit implementations of 'Interfaces', by which I mean things like zope.interface. It didn't help that I generally encountered them as part of very large, complex systems like Zope and Twisted, and they tended to come with a lot of extra magic features, which made the whole idea seem like the sort of thing you only needed if you had to deal with such a beast.

Then, recently, the penny dropped and I finally saw the light. Shorn of complexity and extra features, what Interface implementations give you is an explicit and easily used way to assert and ask 'is-a' questions. Need to find out if this object is a compiled regular expression? Just ask if it supports the ICRegexp interface. What to be accepted as a compiled regular expression? Assert that you support ICRegexp.

(Assuming the best and forging ahead is still the most Pythonic approach, but per my original problem you sometimes do need to know this sort of thing. And per yesterday's entry, requiring inheritance is not the answer, especially if you want to build decoupled systems.)

When I put it this way, it's easy to see why you'd like a basic interface implementation. If you have to test at all, simple 'is-a' tests beat both 'is-a-descendant-of' restrictions and probing for duck typing with its annoyances and ambiguities (cf an earlier entry).

In this view, the important thing is really to have a unique name (really an object) for each interface, so that you avoid the duck typing ambiguity. A basic implementation is almost trivial; treat interfaces as opaque objects, and just register classes as supporting interfaces and then have an 'isinterface()' function that works by analogy to isinstance().

(This demonstrates the old computer science aphorism that there's no problem that can't be solved by an extra level of indirection, since that is basically what this does: it adds a level of indirection to isinstance(), so that instead of asking 'is this object an instance of one of these classes', you ask 'is this object an instance of a class that supports this interface'.)

More complex implementations are of course possible; you could give the interface objects actual behavior and information, add checks for basic duck typing compatibility with the interface, make it so that isinterface() can optionally check to see if the object seems to implement the interface without having declared it, and so on.

(Sooner or later you end up back at zope.interface.)

python/GettingInterfaces written at 00:19:21; Add Comment

2009-11-16

'Is-a' versus 'is-a-descendant-of'

One of the things that my issue with the Python re module not exposing its types has firmly mashed my nose into is the difference between 'is-a-descendant-of' and 'is-a' in object-oriented languages. It's conventional to think of them as more or less the same thing, even in a loose duck typed language like Python; it just seems to make sense for all compiled regular expressions to descend from a single base class, just as it theoretically makes sense for both plain bytestrings and Unicode strings to descend from an abstract generic string class.

(Technically, some of the things that I am calling classes here are actually types. In Python this is a distinction that can usually be ignored.)

Of course, when I write it out like this it's evident that it doesn't necessarily make sense. For example, the actual implementation of Python's base string class has no code and no behavior; it exists only for the convenience of programmers who want to isinstance() a single class. Similarly, a hypothetical version of the the C based regular expression module that used different classes for different sorts of regular expressions (in order to have different matching engines) could perfectly well have no common abstract class (especially since the re module does not expose such a class today).

On the flipside, it would be nice to be able to write alternate regular expression engines and have their objects accepted as 'compiled regular expressions'. Right now, anything that does duck typing will accept them, but things that look at types won't, purely because they don't descend from the current implementation of the regexp class (and you can't fix that, partly for reasons that I covered yesterday).

What this gets down to is that 'is-a' is effectively a question of interface, not of inheritance. In fact, duck typing in a nutshell is that your object 'is-a' compiled regular expression if it satisfies the expected interface behavior for such objects. Even in Python, we almost always use 'is-a-descendant-of' tests only as a convenient proxy for answering this 'is-a' question, but they are not quite the same thing and the difference can trip you (or other people) up.

(I'm sure I've read about this before, but there is a certain vividness to things this time around because I've had my nose rubbed in this.)

python/InheritanceVsInterface written at 00:50:50; Add Comment

2009-11-15

A limitation of Python types from C extension modules

It's recently struck me that there is an important difference between types (and classes) created in a Python module and types/classes that come from a C-level extension module.

Suppose that duck typing is not enough and so you really want to make a class that inherits from an outside class (one in another module), yet overrides all of its behavior. This lets you create objects that work the way you need them to but will pass isinstance() checks that are insisting on instances of the original class. Specifically, you want to be able to create instances of your new class without going through the normal object initialization process of your parent class.

(Yes, you'll need to do your own initialization instead to make your version of the behavior all work out, since once you're not using the parent type's initialization you can't assume that any of the parent's other methods keep working.)

If the outside module is a Python module, you can always (or perhaps almost always) do this. If the outside module is a C extension module, there is no guarantee that you will be able to do this (and sometimes you may not even be able to create your descendant class, much less initialize new instances of it). Fundamentally, the reason for this is the same reason as the reason you can't use object.__new__ on everything; the C module is the only thing that knows how to set up the C-level structures for its own objects, so it has to be involved in creating new instances.

This means that types created in C modules can be effectively sealed against descent and impersonation; they simply can't be substituted for in a way that will fool isinstance(). The corollary is that using isinstance() can in some situations be a much stronger guard than you might be expecting.

(It's possible to make a C-level type inheritable; all of the core Python types are C-level types, after all, and you can do things like inherit from list and str and so on.)

python/CModuleTypeLimitation written at 01:27:50; Add Comment

2009-11-14

How to defer things in Exim

Normally, Exim routers will only accept or fail addresses (or be uninterested in them). This is good enough for normal handling of addresses, but if you are using routers to their full power, there are times when you want to force routers to defer addresses instead. There are two general ways to do this.

(Unsuccessful DNS lookups can cause addresses to defer, but this is not normally under your control.)

The straightforward way is to use a separate router to explicitly defer the address using the :defer: action of the redirect driver, like so:

defer_addr:
  driver = redirect
  allow_defer
  data = :defer:stalling

  [... whatever condition needed ...]

Using a separate router is straightforward and makes for clear log messages about what is going on. However, it's not always possible (or desirable) to use a separate router. In that case you can abuse string expansion to cause an expansion failure while expanding some option where this will force the router to defer.

This is moderately tricky for two reasons. First, you cannot just force string expansion to fail explicitly (via an ${if} or the like), because explicit failure doesn't wind up causing options to defer this way; instead, the router generally fails or passes on the address. Only 'natural' expansion failure, for reasons that Exim thinks are outside of your control, cause this failure. The one case that I know of is if you use ${readfile} on a nonexistent file.

Second, you need to pick a router option where expansion failure causes a deferral and, ideally, that you are not already using. The Exim documentation is the final authority on what router options will do for this (see generic options for routers and check what each option does on non-forced expansion failure); the one that I have found useful in our mailer configuration is address_data. Thus, part of our deliver-to-/var/mail router looks like:

postbox:
  driver = accept
  transport = local_delivery
  # make sure it's mounted
  address_data = ${readfile{/var/mail/.MOUNTED}}

  [....]

(Our /var/mail is NFS mounted on the mail server, and obviously we only want to do deliveries there if it is the real, NFS-mounted filesystem, not the empty directory that's visible if the mount has failed for some reason. .MOUNTED is just an empty file.)

The drawback of this approach is that Exim will log alarmed looking and rather cryptic error messages if the condition every fails and forces messages to be deferred, so it is best reserved for conditions that you don't expect to happen very often.

sysadmin/EximDeferRouters written at 02:39:55; Add Comment

2009-11-13

(Ab)using Exim routers for their full power

Officially, as reflected in the documentation, Exim routers are expected to take more or less disjoint sets of addresses; for example, you have one router to do DNS lookups and SMTP for external addresses, one router to handle aliases, one router to expand the .forwards of people with them, and one router to deliver to people's mailboxes for people without .forwards. This makes the ordering of the routers relatively unimportant; approached this way, it is used mostly to make writing routers more convenient by having to be less neurotically careful about what addresses a router applies to.

(There is one exception; traditional .forward handling absolutely requires ordering and cannot be done with router conditions.)

If you want to really do powerful things with Exim routers, you need to go beyond this view. Instead, you should think of routers as (conditional) steps, or decision points, in a peculiar programming language. Not all decision points apply (or potentially apply) to all addresses, but it is entirely natural that multiple routers potentially apply (depending on circumstances) to the same set of addresses; each such router is a step on the conditional handling logic for these addresses.

(This mindset sounds simple when I explain it, but I don't think that it's obvious from the current Exim documentation. I've certainly seen a fair number of 'how to do X' questions asked on the Exim mailing list by people who clearly hadn't made this conceptual leap.)

Once you think of routers this way, ordering becomes important; for routers that handle the same set of addresses, the relative ordering of the routers is the ordering of decision steps about those addresses. Often you have something close to a total order of routers because you will want to do some common things with all addresses.

To make all of this less abstract, here is the list of decisions that our central mail system makes about external addresses, each implemented with a separate router:

  1. is this a locally generated bounce of a spam message? discard if so
  2. is this a looping bounce message? discard if so
  3. is all further handling of this address being manually deferred?
  4. if this is a spam message, has it exceeded the timeout interval for this address's domain? bounce if so
  5. route the address with DNS lookups and deliver the message via SMTP

(Some but not all of these also apply to internal addresses too.)

Sidebar: why .forward handling requires ordering

I cheated in the my example description of Exim routers. Traditional .forward semantics allow you to put your own email address in your .forward again; this means 'deliver to me, bypassing my .forward', which usually winds up putting a copy of the message in /var/mail. If you want to support these semantics under Exim, the router that delivers messages to /var/mail cannot apply only to people who do not have .forwards, and thus has to be ordered after the router that handles .forwards.

(How Exim makes these semantics work is a little bit complicated.)

sysadmin/EximRouterPower written at 00:09:56; Add Comment

2009-11-12

What makes Exim work as a mailer construction kit

In light of Postfix versus Exim, you might wonder what features make Exim into a mailer construction kit. For me, the easiest way to summarize the answer is to say that Exim has the idea of what I will call a user-written mail processing pipeline (actually two of them, sort of).

By a mail processing pipeline I mean a series of steps that messages go through to decide what will happen to them and how they will be delivered. In many MTAs, this processing pipeline is more or less fixed, with you having opportunities to add a table lookup here or mangle addresses there. In Exim, there is no fixed processing pipeline; you write it entirely from scratch yourself, using relatively generic components to do most of the work. The result is that you have a great deal of flexibility in what happens in those pipelines; in other words, how messages get handled and delivered is to a large extent under your direct control.

(The two drawbacks of this are that you have to write the pipeline yourself and that it is much easier to screw things up in various ways, some of them subtle.)

Conceptually, Exim has two major places with this sort of processing flexibility. The first is deciding how to route an address to one or more delivery destinations; you write a series of what Exim calls 'routers', and then they get used in sequence to process each address in various ways, hopefully ultimately delivering them somewhere.

(The Exim documentation describes routers and this routing process in a way that makes it sound less powerful than it is.)

The other major place with such a processing pipeline is deciding what reply code to give for each SMTP command in a SMTP conversation. In Exim you do this by writing a series of ACL rules for each command, again using relatively generic components to do most of the hard work. These rules can do quite powerful and generic things, and the combination can be quite powerful.

(Exim also gets a fair bit of its general power from its crazy string expansion language; this comes up when writing both routers and SMTP ACL rules.)

sysadmin/EximMailerKit written at 01:36:25; Add Comment

2009-11-11

For universities, the Internet world has fundamentally changed

Once upon a time, the Internet was just something that you used to communicate with other universities (and companies). One consequence of this was that the university needed to provide everything for its own people; all of the services they needed needed to come from the university.

This is no longer the case for universities. Increasingly, people no longer want you to be their service provider (partly because they already have their own), and on top of that other people can do bits of it better than you can (consider Google Mail versus your typical university webmail interface).

This is a major, wrenching change in how you think about providing services, and part of what makes it wrenching is that expectations have to be changed too. To put it bluntly, you can't be held responsible for the service being available, because there will be times that the service is unavailable or broken for reasons that are completely beyond your control.

This is, I think, not a trivial thing. 'Responsibility' is burned very deeply into organizations; it's in people's attitudes towards their jobs, in mission statements and organizational descriptions, and in expectations by higher administration. Letting go is hard, because it is such a fundamental change; you stop being responsible for user email, for example, and instead become 'responsible' merely for making the best choice of outside provider (or running it yourself, but let's be honest here, Google is better if you can use it).

(This assumes that you do just become responsible for picking the best outside provider. If in practice you will be held responsible if something unforeseeable goes horribly wrong with the outside provider, then the sensible and predictable managerial response is to keep doing as much in house as possible.)

PS: application to the general university tension between locally provided services and centrally provided services is left as an exercise for the reader.

tech/UniversityInternetWorld written at 00:54:52; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
[There's more, starting at 2009/11/10 or Previous 10]
(Previous day)
By day for November 2009: 2 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19 20; before November.

Page tools: See As Blogdir, See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.