2009-03-18
An obvious thing about dealing with web spider misbehavior
Here's an important hint for operators of web spiders:
If I have to touch robots.txt or my web server configuration to deal with your spider's behavior, the easiest and safest change is to block you entirely.
It is therefor very much in the interests of web spider operators
to keep me from ever having to touch my robots.txt file in the
first place, because you are not Google.
You should consider things like crawl-delay
to be desperate last
ditch workarounds, not things that you routinely recommend to people
you are crawling.
(Yes, this means that web spiders should notice issues like this for
themselves. The best use for crawl-delay
that I've seen suggested is as a way to tell spiders to speed up, not
slow down, their usual crawling rate.)
A corollary to this is that your spider information page should do its best to concisely tell people what they get out of letting you crawl their web pages, because if people have to change their robots.txt to deal with your spider you want them to have as much reason not to block you as possible.
(You had better have a spider information page and mention it in your
spider's User-agent
. Ideally it will also rank high in searches for
your spider's official name and for its User-agent
string.)
Principles of email in the modern age
This is not the Internet that we used to have, and so email is not what it used to be; now it is less. So I think that we need (or could do with) some principles of email for the modern age of the Internet, things that can guide people writing applications that might use email as part of their interactions with the world.
Now, a disclaimer: people are going to have different views of this. My view is a tired and somewhat cynical anti-spam biased one, added to sysadmin caution; optimists will be, well, more optimistic.
So, in my view, here are some principles of email in the modern age:
- email can only be sent to people who've already registered with you;
among other consequences, you can never send email to person B
just because person A said it was okay.
- email is not reliable; there are too many spam filters and people
hitting delete really fast because your subject lines looked suspect
or they've never heard of you or whatever.
- email is not trustable, or at least you should not train your users that it is, because your users are generally incapable of correctly judging whether or not they should trust a specific piece of email.
The last principle is a bit subtle. If your users get specific trustable information in email, you are training them to trust the information that they read in 'your' email. Phishers and other malicious parties love that, because they can forge your email and most people, who are not suspicious, will believe it.
There are probably more sensible principles that I am not thinking of right now. Suggestions are welcome.
(Note that I am skipping operational issues.)