2006-03-02
A robots.txt surprise
Because I don't really like banning MSNBot, MSN Search's web spider, I decided to drop our ban and see if its behavior had improved since last September. The process of doing this has led me to a little surprise about how at least MSNBot matches User-Agent lines in robots.txt.
From looking at our logs, I already knew that MSNBot was still visiting; it pulled robots.txt at least once a day. So all I needed to do was change robots.txt so that it wouldn't be banned.
Since I wanted to note down when I removed the ban, I just added a suffix on the User-Agent string, changing from banning 'msnbot' to banning 'msnbot-reenabled-2006-02-14'. To my surprise nothing happened, so I changed it again, putting 'X-20060222-' on the front. Still nothing happened.
Finally, yesterday evening I changed 'msnbot' to 'mXsXnbXot'. Within 12 hours, MSNBot had started crawling pages here.
The MSNBot web page is rather
non-specific about how MSNBot decides whether or not it's excluded;
all of their examples certainly use just 'msnbot
' as the User-Agent
string. A prefix match made sense to me, since it doesn't hose people
who put things like 'msnbot/1.0
' in their robots.txt, but the rest
was surprising.
It turns out that this is actually recommended behavior; the Standard for Robot Exclusion web page says:
The robot should be liberal in interpreting [the User-Agent] field. A case insensitive substring match of the name without version information is recommended.
I don't know how many robots follow this, but MSNBot evidently does. Good for them.
The :;
shell prompt trick
For years, I've had a somewhat unusual shell prompt. It looks like this:
: <host> ;
(where <host>
is the hostname of the current machine.)
Putting the hostname in your prompt is pretty ordinary, but what's
the other stuff? These days, a more typical shell prompt is something
like 'cks@newman:~$
', to quote a Debian example. (And many
people use more elaborate prompts, such as Jamie Zawinksi's.)
The trick here is that the :
and ;
turn my prompt into a valid shell
command that does nothing. This makes cutting and pasting previous
commands in things like xterm
much easier, since I don't have to
carefully get just the command while avoiding the prompt. (In xterm
it's just a quick triple click, but then xterm
is very good at this.)
(In practice I am sufficiently neurotically neat that I select just the
command, because seeing a doubled prompt looks wrong. This might be
different if my prompt was just ':;
', but I need the host name in it
to keep things straight.)
This trick is not original to me; I believe I got it from observing Geoff Collyer, many years ago.
Sidebar: xterm
's double-click selections
One reason I don't use this more is that xterm
's double-click
selection mode makes selecting most things pretty fast anyways.
For those who aren't aware of it, when you start a selection by
double-clicking instead of single-clicking, xterm
grows the selection
by words instead of characters. (Try it; it's more intuitive than I
make it sound.)
Embarrassingly, I spent years using xterm
before I found out about
this. Now I use it all the time, and hardly ever have to select by
characters.