Wandering Thoughts archives

2006-02-11

The return of how to get your web spider banned

Today's entrant is 'Uptilt Inc', uptilt.com, aka 64.71.164.96/27. Let's see how they did on my scale:

Important update: it turns out I was fooled by out of date WHOIS information and Uptilt Inc isn't involved; for the full story, see UptiltUpdate. Remember that when you read the historical references to them in the rest of this entry.

  • 1,159 requests in one day.
  • 25+ requests for several URLs that are permanent redirections. The redirected to pages haven't changed recently, either.

  • They had the generic user-agent string "NutchCVS/0.8-dev (Nutch; [...])". At least it included a URL to the Nutch page. (It of course did not include a URL to any of their own pages.)

  • they did frequently fetch robots.txt; 28 times in one day, in fact.

  • None of the 10 different IP addresses in 64.71.164.96/27 that hit us have reverse DNS. (In fact, nothing in the subnet has reverse DNS.)
  • The subnet has no useful contact information, apart from the fact that Hurricane Electric says it belonged to an 'Uptilt Inc'. There is an uptilt.com, but to make you wonder it lives in a different subnet and its WHOIS data has a different physical address. However, the uptilt.com website says Uptilt Inc's headquarters is at the same address as HE has for the owners of 64.71.164.96/27.

In short: even more searching than last time.

  • Of course, www.uptilt.com has no information on any spidering activity they may be doing. Instead, it has lots of information on them being a "leading provider of Marketing Automation software solutions", and their subsidiary emaillabs.com being a "leading provider of advanced email marketing solutions".
  • they lose points for having prominent links to a website called 'crm.uptilt.com', which doesn't exist. Some of the links to their privacy policy and so on don't work either.
  • Since around here 'email marketing' tends to be spelled S-P-A-M, I wasn't exactly encouraged to send them any email about their spider. These days if you're involved in 'email marketing', I feel that you had better bend over backwards to reassure people that you're not a spammer and you understand all the rules and so on.

Overall score: BANNED. Since they use a generic user agent string (even though it does check robots.txt), their subnet now resides in our permanent kernel level IP blocks alongside our first contestant.

(We actually banned them a bit under two weeks ago, but I've only gotten around to writing this up now. The kernel IP block counters show that they've tried to drop by a few times since their ban.)

web/HowToGetYourSpiderBannedII written at 00:26:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.