Wandering Thoughts archives


Notice to web spiders: an email address in your user-agent isn't good enough

Every so often I turn over a rock here at Wandering Thoughts by looking at what IP addresses are making a lot of requests. Most of the time that's Bing's bot, but every so often something else floats to the top of the list, and generally it's not something that leaves a favorable impression. Today's case was clearly a web spider, from IP address (which currently resolves to 'getzonefile.commedia.io') and with the User-Agent of:

"Mozilla/5.0 (compatible; Go-http-client/1.1; +centurybot9@gmail.com)"

This has caused me to create a new rule for web spiders: just having an email address in your User-Agent is not good enough, and in fact will almost certainly cause me to block that spider on contact.

What the User-Agent of a web spider is supposed to include is a website URL where I can read about what your web spider is and what benefit I get from allowing it to crawl Wandering Thoughts. Including an email address does not provide me with this information, and it doesn't even provide me with a meaningful way of reporting problems or complaining about your web spider, because in today's spam-laden Internet environment the odds that I'm going to send email to some random address is zero (especially to complain about something that it is nominally doing).

Of course, it turns out that this is not the only such User-Agent that I've seen (and blocked). Other ones that have shown up in recent logs are:

"MauiBot (crawler.feedback+wc@gmail.com)"

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36; +collection@infegy.com"

"Mozilla/5.0 (compatible; um-LN/1.0; mailto: techinfo@ubermetrics-technologies.com)"

"Mediatoolkitbot (complaints@mediatoolkit.com)"

The MauiBot crawler is apparently reasonably well complained-about. I haven't found any particular mentions of the 'infegy.com' one from casual searches, but it's probably real (in one sense) given infegy.com's website.

(I also found one feed fetcher that appears to be pulling my feed with a User-Agent that lists an email address and a program name of 'arss/testing', but I've opted not to block it for now or mention the email address. If its author is reading this, you need a URL in there too.)

I'm not sure what web spider authors are thinking when they set their User-Agents up this way, and frankly I don't care (just as I don't care whether these email addresses are genuine and functional, or simply made up and bogus). On the one hand they are admitting that this is a web spider at work, but on the other hand they're fumbling at informing web server operators about their spiders.

PS: I'm aware that blocking web spiders this way is a quixotic and never-ending quest. There are a ton of nasty things out there, even among the ones that more or less advertise themselves. But sometimes I do these things anyway, because once I've turned over a rock I'm not good at looking away.

web/WebSpiderEmailNotEnough written at 02:03:51; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.