== A robots.txt surprise Because I don't really like [[banning MSNBot BanningMSNBot]], MSN Search's web spider, I decided to drop our ban and see if its behavior had improved since [[last September CrazyMSNCrawler]]. The process of doing this has led me to a little surprise about how at least MSNBot matches User-Agent lines in [[robots.txt ]]. From looking at our logs, I already knew that MSNBot was still visiting; it pulled [[robots.txt]] at least once a day. So all I needed to do was change robots.txt so that it wouldn't be banned. Since I wanted to note down when I removed the ban, I just added a suffix on the User-Agent string, changing from banning 'msnbot' to banning 'msnbot-reenabled-2006-02-14'. To my surprise nothing happened, so I changed it again, putting 'X-20060222-' on the front. *Still* nothing happened. Finally, yesterday evening I changed 'msnbot' to 'mXsXnbXot'. Within 12 hours, MSNBot had started crawling pages here. The [[MSNBot web page http://search.msn.com/msnbot.htm]] is rather non-specific about how MSNBot decides whether or not it's excluded; all of their examples certainly use just '_msnbot_' as the User-Agent string. A prefix match made sense to me, since it doesn't hose people who put things like '_msnbot/1.0_' in their robots.txt, but the rest was surprising. It turns out that this is actually recommended behavior; the [[Standard for Robot Exclusion http://www.robotstxt.org/wc/norobots.html]] web page says: > The robot should be liberal in interpreting [the User-Agent] field. > A case insensitive substring match of the name without version > information is recommended. I don't know how many robots follow this, but MSNBot evidently does. Good for them.