2011-07-02
Dear Googlebot: SMTP is not HTTP
From the logs of a SMTP server here:
32301# remote from [66.249.67.36] 32301r GET /robots.txt HTTP/1.1 32301w 550 Syntax error 32301r Host: 128.100.3.51:25 32301w 550 Unknown command 'Host' 32301r Connection: Keep-alive 32301w 550 Syntax error 32301r Accept: text/plain,text/html 32301w 550 Syntax error 32301r From: googlebot(at)googlebot.com 32301w 550 Unknown command 'From' 32301# aborted: session terminated
(The abort is from my server, which drops connections after too many syntax errors.)
Then it immediately tries the same thing with 'GET / HTTP/1.1
'
instead. Oh, and this is nowhere near the first time that Googlebot
has tried this; the first instance in my logs dates from 2007.
Yes, I'm sure that somewhere there is something that looks like a HTTP link to port 25 on this IP address (although Google doesn't know about it; I've tried the obvious web search). But this is still a failure on Google's part, because they should be much more careful than this with any 'url' that involves a port that is known to be used for another protocol. Sure, someone could be running a web server on port 25 against all expectations, but the odds are far better that someone has created a bad or malicious link. And certainly when Googlebot has been receiving SMTP replies for years, it should stop attempting to crawl entirely.
The other failure is that Googlebot should not have made the second
query for /
after its attempt to retrieve robots.txt failed. This
was not a web server telling Googblebot 'there is no such file here';
this was the retrieval itself failing with protocol errors. Even if
Googlebot does not specifically have recognizers for SMTP responses (and
I maintain that it should), an odd port plus protocol failures should
mean 'this is probably not a web server, stop now'.
PS: I'm aware that part of the blame falls on my MTA for being so old that it doesn't immediately disconnect Googlebot for illegal pipelining (I assume that that's what's happening here).