Wandering Thoughts archives

2011-07-02

Dear Googlebot: SMTP is not HTTP

From the logs of a SMTP server here:

32301#  remote from [66.249.67.36]
32301r  GET /robots.txt HTTP/1.1
32301w  550 Syntax error
32301r  Host: 128.100.3.51:25
32301w  550 Unknown command 'Host'
32301r  Connection: Keep-alive
32301w  550 Syntax error
32301r  Accept: text/plain,text/html
32301w  550 Syntax error
32301r  From: googlebot(at)googlebot.com
32301w  550 Unknown command 'From'
32301#  aborted: session terminated

(The abort is from my server, which drops connections after too many syntax errors.)

Then it immediately tries the same thing with 'GET / HTTP/1.1' instead. Oh, and this is nowhere near the first time that Googlebot has tried this; the first instance in my logs dates from 2007.

Yes, I'm sure that somewhere there is something that looks like a HTTP link to port 25 on this IP address (although Google doesn't know about it; I've tried the obvious web search). But this is still a failure on Google's part, because they should be much more careful than this with any 'url' that involves a port that is known to be used for another protocol. Sure, someone could be running a web server on port 25 against all expectations, but the odds are far better that someone has created a bad or malicious link. And certainly when Googlebot has been receiving SMTP replies for years, it should stop attempting to crawl entirely.

The other failure is that Googlebot should not have made the second query for / after its attempt to retrieve robots.txt failed. This was not a web server telling Googblebot 'there is no such file here'; this was the retrieval itself failing with protocol errors. Even if Googlebot does not specifically have recognizers for SMTP responses (and I maintain that it should), an odd port plus protocol failures should mean 'this is probably not a web server, stop now'.

PS: I'm aware that part of the blame falls on my MTA for being so old that it doesn't immediately disconnect Googlebot for illegal pipelining (I assume that that's what's happening here).

GooglebotAndSMTP written at 00:35:17; Add Comment

By day for July 2011: 2 15; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.