The return of how to get your web spider banned
Today's entrant is
'Uptilt Inc', uptilt.com, aka
220.127.116.11/27. Let's see how they did on my scale:
Important update: it turns out I was fooled by out of date WHOIS information and Uptilt Inc isn't involved; for the full story, see UptiltUpdate. Remember that when you read the historical references to them in the rest of this entry.
- 1,159 requests in one day.
- 25+ requests for several URLs that are permanent redirections.
The redirected to pages haven't changed recently, either.
- They had the generic user-agent string "NutchCVS/0.8-dev
(Nutch; [...])". At least it included a URL to the Nutch
page. (It of course
did not include a URL to any of their own pages.)
- they did frequently fetch robots.txt; 28 times in
one day, in fact.
- None of the 10 different IP addresses in 18.104.22.168/27 that hit us have reverse DNS. (In fact, nothing in the subnet has reverse DNS.)
- The subnet has no useful contact information, apart from the fact that Hurricane Electric says it belonged to an 'Uptilt Inc'. There is an uptilt.com, but to make you wonder it lives in a different subnet and its WHOIS data has a different physical address. However, the uptilt.com website says Uptilt Inc's headquarters is at the same address as HE has for the owners of 22.214.171.124/27.
In short: even more searching than last time.
- Of course, www.uptilt.com has no information on any spidering activity they may be doing. Instead, it has lots of information on them being a "leading provider of Marketing Automation software solutions", and their subsidiary emaillabs.com being a "leading provider of advanced email marketing solutions".
- Since around here 'email marketing' tends to be spelled S-P-A-M, I wasn't exactly encouraged to send them any email about their spider. These days if you're involved in 'email marketing', I feel that you had better bend over backwards to reassure people that you're not a spammer and you understand all the rules and so on.
Overall score: BANNED. Since they use a generic user agent string (even though it does check robots.txt), their subnet now resides in our permanent kernel level IP blocks alongside our first contestant.
(We actually banned them a bit under two weeks ago, but I've only gotten around to writing this up now. The kernel IP block counters show that they've tried to drop by a few times since their ban.)