A robots.txt surprise

Because I don't really like banning MSNBot, MSN Search's web spider, I decided to drop our ban and see if its behavior had improved since last September. The process of doing this has led me to a little surprise about how at least MSNBot matches User-Agent lines in robots.txt.

From looking at our logs, I already knew that MSNBot was still visiting; it pulled robots.txt at least once a day. So all I needed to do was change robots.txt so that it wouldn't be banned.

Since I wanted to note down when I removed the ban, I just added a suffix on the User-Agent string, changing from banning 'msnbot' to banning 'msnbot-reenabled-2006-02-14'. To my surprise nothing happened, so I changed it again, putting 'X-20060222-' on the front. Still nothing happened.

Finally, yesterday evening I changed 'msnbot' to 'mXsXnbXot'. Within 12 hours, MSNBot had started crawling pages here.

The MSNBot web page is rather non-specific about how MSNBot decides whether or not it's excluded; all of their examples certainly use just 'msnbot' as the User-Agent string. A prefix match made sense to me, since it doesn't hose people who put things like 'msnbot/1.0' in their robots.txt, but the rest was surprising.

It turns out that this is actually recommended behavior; the Standard for Robot Exclusion web page says:

The robot should be liberal in interpreting [the User-Agent] field. A case insensitive substring match of the name without version information is recommended.

I don't know how many robots follow this, but MSNBot evidently does. Good for them.

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
Written on 02 March 2006.
(Previous | Next)

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Mar 2 16:20:13 2006
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.