Wandering Thoughts archives


An obvious thing about dealing with web spider misbehavior

Here's an important hint for operators of web spiders:

If I have to touch robots.txt or my web server configuration to deal with your spider's behavior, the easiest and safest change is to block you entirely.

It is therefor very much in the interests of web spider operators to keep me from ever having to touch my robots.txt file in the first place, because you are not Google. You should consider things like crawl-delay to be desperate last ditch workarounds, not things that you routinely recommend to people you are crawling.

(Yes, this means that web spiders should notice issues like this for themselves. The best use for crawl-delay that I've seen suggested is as a way to tell spiders to speed up, not slow down, their usual crawling rate.)

A corollary to this is that your spider information page should do its best to concisely tell people what they get out of letting you crawl their web pages, because if people have to change their robots.txt to deal with your spider you want them to have as much reason not to block you as possible.

(You had better have a spider information page and mention it in your spider's User-agent. Ideally it will also rank high in searches for your spider's official name and for its User-agent string.)

web/SpiderRobotsTxtHint written at 22:59:34; Add Comment

Principles of email in the modern age

This is not the Internet that we used to have, and so email is not what it used to be; now it is less. So I think that we need (or could do with) some principles of email for the modern age of the Internet, things that can guide people writing applications that might use email as part of their interactions with the world.

Now, a disclaimer: people are going to have different views of this. My view is a tired and somewhat cynical anti-spam biased one, added to sysadmin caution; optimists will be, well, more optimistic.

So, in my view, here are some principles of email in the modern age:

  • email can only be sent to people who've already registered with you; among other consequences, you can never send email to person B just because person A said it was okay.

  • email is not reliable; there are too many spam filters and people hitting delete really fast because your subject lines looked suspect or they've never heard of you or whatever.

  • email is not trustable, or at least you should not train your users that it is, because your users are generally incapable of correctly judging whether or not they should trust a specific piece of email.

The last principle is a bit subtle. If your users get specific trustable information in email, you are training them to trust the information that they read in 'your' email. Phishers and other malicious parties love that, because they can forge your email and most people, who are not suspicious, will believe it.

There are probably more sensible principles that I am not thinking of right now. Suggestions are welcome.

(Note that I am skipping operational issues.)

spam/ModernEmail written at 00:20:37; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.