A robots.txt surpriseBecause I don't really like banning MSNBot, MSN Search's web spider, I decided to drop our ban and see if its behavior had improved since last September. The process of doing this has led me to a little surprise about how at least MSNBot matches User-Agent lines in robots.txt. From looking at our logs, I already knew that MSNBot was still visiting; it pulled robots.txt at least once a day. So all I needed to do was change robots.txt so that it wouldn't be banned. Since I wanted to note down when I removed the ban, I just added a suffix on the User-Agent string, changing from banning 'msnbot' to banning 'msnbot-reenabled-2006-02-14'. To my surprise nothing happened, so I changed it again, putting 'X-20060222-' on the front. Still nothing happened. Finally, yesterday evening I changed 'msnbot' to 'mXsXnbXot'. Within 12 hours, MSNBot had started crawling pages here. The MSNBot web page is rather
non-specific about how MSNBot decides whether or not it's excluded;
all of their examples certainly use just ' It turns out that this is actually recommended behavior; the Standard for Robot Exclusion web page says:
I don't know how many robots follow this, but MSNBot evidently does. Good for them. |
These are my WanderingThoughts GettingAround This is part of CSpace, and is written by ChrisSiebenmann. * * * Atom feeds are available; see the bottom of most pages. Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web |